\documentclass[a4paper,12pt]{report}
\usepackage{graphicx}
\usepackage{pdfpages}
\usepackage{soul}
\usepackage[latin1]{inputenc}
\usepackage{chronology}
\usepackage{array}
\usepackage{xcolor}
\usepackage{caption}
\usepackage{subcaption}
\usepackage{epigraph}
\usepackage{url}
\usepackage{multirow}
\usepackage{graphics}
\usepackage{pdflscape}
\usepackage{lscape}
\usepackage{natbib}
\usepackage{pst-all}
\usepackage{gnuplottex}
\usepackage{com.braju.graphicalmodels}
\usepackage[left=1.5in,right=1.3in,top=1.1in,bottom=1.1in,includefoot,includehead,headheight=13.6pt]{geometry}
\usepackage[pdftex, plainpages = false, pdfpagelabels,
                pdfpagelayout = useoutlines,
                 bookmarks,
                 bookmarksopen = true,
                 bookmarksnumbered = true,
                 breaklinks = true,
                 linktocpage,
                 pagebackref,
                 colorlinks = true,
                 linkcolor = blue,
                 urlcolor  = blue,
                 citecolor = blue,
                 anchorcolor = green,
                 hyperindex = true,
                 hyperfigures]{hyperref}

\usepackage[english]{babel}
\usepackage[protrusion=true,expansion=true]{microtype}
\usepackage{amsmath,amsfonts,amsthm,amssymb}


% ------------------------------------------------------------------------------
% Definitions (do not change this)
% ------------------------------------------------------------------------------
\newcommand{\HRule}[1]{\rule{\linewidth}{#1}} 	% Horizontal rule

\makeatletter							% Title
\def\printtitle{%						
    {\centering \@title\par}}
\makeatother									

\makeatletter							% Author
\def\printauthor{%					
    {\centering \large \@author}}				
\makeatother							

% ------------------------------------------------------------------------------
% Metadata (Change this)
% ------------------------------------------------------------------------------
\title{	\normalsize \textsc{Final Draft} 	% Subtitle of the document
		 	\\[2.0cm]													% 2cm spacing
			\HRule{0.5pt} \\										% Upper rule
			\LARGE \textbf{\uppercase{The File-Drawer Problem}}	% Title
			\HRule{2pt} \\ [0.5cm]								% Lower rule + 0.5cm spacing
			\normalsize \today									% Todays date
		}

\author{
		Alexander L. Davis\\
		Carnegie Mellon University\\	
		Department of Social and Decision Sciences\\
        \texttt{alexander.l.davis1@gmail.com} \\
}


\raggedright
\parindent=1.5em

\begin{document}
% ------------------------------------------------------------------------------
% Maketitle
% ------------------------------------------------------------------------------
\thispagestyle{empty}				% Remove page numbering on this page

\printtitle									% Print the title data as defined above
  	\vfill
\printauthor								% Print the author data as defined above
% ------------------------------------------------------------------------------
\renewcommand{\thepage}{\roman{page}}
\label{TOC}
\tableofcontents
\listoftables
\listoffigures
\renewcommand{\thepage}{\arabic{page}}
\catcode`\@=11%
\psset{unit=14mm,arrowscale=1.5}
\SpecialCoor
\begin{abstract}
This dissertation provides normative, descriptive, and prescriptive analyses of a scientist's decision to share data.  The normative analysis (Chapter Two) concludes that, although there is no logical ground for determining whether data or theory is faulty when they conflict, data sharing policies that omit disconfirming data are unethical because they impose conventions on the reader, thus deceiving them.  However, five experiments (Chapter Four) find that surprising disconfirmations are perceived to be caused by error, and future observations that are seen as diffuse are judged to be less worthy of publication.  The second part of the normative analysis (Chapter Three) concludes that disconfirmations are more likely to be errors than affirmations only when the selection of true hypotheses is common.  However, participants in the Wason rule discovery task (Chapter Five), who were asked to discover the rule that generated a set of three numbers (2,4,6), thought the opposite.  With no penalty for incorrect error attributions, participants proposed triples that did not fit the rule (false hypotheses) more often than those that did fit the rule, but attributed error more often to disconfirmation than affirmation.  Furthermore, they shared data based on their attributions of error, and these error attributions were affected by whether feedback was affirming or disconfirming, even after controlling for whether the data were actually error.  The prescriptive analysis (Chapter Six) proposes methods of documenting data, methods, and statistical analyses so that penalties can be implemented when inferences are faulty or documentation is poor.  The dissertation concludes with a recapitulation of the normative, descriptive, and prescriptive analyses and highlights directions for future work. 
\end{abstract}

\part{The Problem of Data-Sharing}

\chapter{A Short History of Data Sharing}
\setlength{\epigraphrule}{0pt}
\setlength{\epigraphwidth}{.95\textwidth}
\begin{epigraphs}
\centering
\qitem{The first principle is that you must not fool yourself---and you are the easiest person to fool.  So you have to be very careful about that.  After you've not fooled yourself, it's easy to not fool other scientists.  You just have to be honest in a conventional way after that.} 
{---\textsc{Richard Feynman, 1974 \cite{feynman1974cargo}}}

\qitem{In a desert prison, an older prisoner befriends a new arrival. The young prisoner talks constantly about escape, spinning plan after plan. After a few months, he makes a break. He's gone a week; then the guards drag him back. He's half dead, crazy with hunger and thirst. He wails how awful it was to the old prisoner: endless stretches of sand, no oasis, failure at every turn. The old prisoner listens for a while, then says, ``Yep. I know. I tried those escape plans myself, 20 years ago.'' The young prisoner says, ``You did? Why didn't you tell me?'' The old prisoner shrugs: ``So who publishes negative results?''}
{---\textsc{Janice Probst, 2006 \cite{probst2006prisoners}}}

\qitem{It can be proven that most claimed research findings are false.}
{---\textsc{John Ioannidis, 2005 \cite{ioannidis2005most}}}

\end{epigraphs}

In research as well as life, I make mistakes.  I make a lot of them.  I form hypotheses poorly, forget to measure age or gender, or create an instrument with very little construct validity.  When an experiment doesn't come out the way I expect, these mistakes seem apparent.  When the data come out the way I like, it's difficult to see flaw.  

I don't want to bother others with my bad research, nor do they want to hear it.  Pressures to feel both competent about myself and be evaluated positively by others push me toward hiding the data I see as flawed.  Yet, when the results come out the way I want, I cannot convince myself that the `bad research' that preceded the supposed discovery is irrelevant.  This is the \emph{file-drawer problem}, where each scientist must decide whether to share unwanted, disconfirming evidence with the scientific community.\footnote{The dissertation is concerned with the inferences and behavior of the scientist who collects the data and must decide to submit it for publication, rather than an editor who decides to publish.  Problems related to data sharing have been called a number of things.  One is the file-drawer problem \cite{rosenthal1979file,rosenthal1988selection}.  Another is selective reporting bias \cite{guyatt2011grade,tse2009reporting}.  A third is the problem of disclosure \cite{hirshleifer2003limited}.  A fourth, is publication bias \cite{guyatt2011grade}.  A fifth is data availability \cite{steinbrook2010data}.  There are probably many more.  The term selective reporting bias, where the researcher chooses to exclude data in a submitted article for publication, is most apt.  The terms file-drawer problem and publication bias are ambiguously applied to both the journal and researcher decisions, although they may be narrowly defined to mean the former.  Use of the term file-drawer problem should be interpreted as selective reporting bias in this dissertation.}

This shame, embarrassment, and disrepute associated with `flawed results' has existed since the inception of institutional science.  In the 18th century, measurement error and personal flaw were synonymous, as the ``concealment of discrepancies in observation were not only common, they were considered a savant's prerogative. It was error that was seen as a moral failing'' \cite{devellis2011scale}.  Little has changed since then; in fact, the file-drawer problem seems to be getting worse.  In 1990, 70\% of results published in a sample of journals across a variety of scientific disciplines were statistically significant \cite{fanelli2012negative}.  This number increased to 86\% in 2007, and is now over 90\% in the social sciences.  From what is published, in the last 300 years scientists have either become flawless researchers or adept at hiding their flaws.

Some have studied this process empirically.  For example, Sterling \cite{sterling1959publication} was one of the first to try to empirically verify the file-drawer problem in Psychology.  He found that almost all (286/294 or 97\%) reports that used significance testing from four prominent psychology journals had `statistically significant' results.  While the published data may be accurate, the almost non-existence of published null results suggests not all data are shared.

Following this, Mahoney \cite{mahoney1977publication} constructed a fake paper on the efficacy of using reinforcement on children to modify their behavior, experimentally varying the results of the intervention.  He sent the different versions of the paper to 75 reviewers of a journal (the \emph{Journal of Applied Behavior Analysis}) that he knew would strongly favor efficacy of the treatment.  When the results were what the reviewers (and their subfield) wanted to hear, they were more likely to recommend publication, and rated the paper as having higher methodological quality.  When the data were not what they wanted to hear, they scrutinized it much more closely, and were three times as likely to find an unplanned typographical error in the manuscript.  

Although omitting negative results from publication seems to be the \emph{de facto} policy, there are those who have argued against it. For decades, Cohen \cite{cohen1992power,cohen1962statistical} and colleagues \cite{sedlmeier1989studies,rossi1990statistical} have lamented the low statistical power of psychological experiments, and argued that this implies that data are suppressed from publication.  Until recently psychologists have not been concerned.  The watershed moment came when Daryl Bem published a paper in the prestigious \emph{Journal of Personality and Social Psychology} providing an experimental demonstration of extra sensory perception \cite{bem2011feeling}.  A flurry of criticism followed, focusing on statistical analysis, peer review, and publishing in Psychology \cite{wagenmakers2011psychologists}.  Following this, Simmons \emph{et al}. \cite{simmons2011false} published a paper accusing psychologists of a culture of unethical data analysis and sharing practices.

Since then, these unethical practices have been exposed several times, including the cases of Marc Hauser, Diederick Stapel, and Dirk Smeesters.  Failure to publish replications of Bem's original paper, most of which did not show the same effect, also spawned new `file-drawer' websites where replications (especially failed ones) can be archived.  Complementing this, several new projects have focused on independently replicating published psychology results \cite{yong2012bad,demets1991data,reynolds2004ori}.  However, these file-drawer websites have so far had little success \cite{heger2012clinical}, and failed independent replications continue to ``go unpublished, languishing in personal file drawers or circulating in conversations around the water cooler'' \cite{yong2012replication}.

The file-drawer problem is not limited to social science research.  Medical research companies and medical schools have a strong financial incentive to make life-saving discoveries, while suppressing research that suggests their discoveries are false.  Like Psychology, traditional publication bias approaches have found that most published medical research is confirmatory or statistically significant \cite{dickersin1993publication}.  For example, Hasenboehler \emph{et al}. \cite{hasenboehler2007bias} found 74\% of studies in orthopedic and general surgery reported positive findings, with another 9\% reporting ambiguous or neutral findings.  

What is published does not accurately reflect the research that is done.  Some evidence supporting this comes from asking scientists what they do with their data.  For example, Martinson, Anderson, and De Vries \cite{martinson2005scientists} surveyed 3,247 NIH funded scientists and found that overall, 0.3\% admitted to falsifying data, 6\% reported not presenting contradictory data, 10.8\% withheld methodological details and results in published papers, and 15.3\% reported dropping observations based on the `gut' judgment that they were in error.

Other evidence comes from comparing published reports to other sources.  For example, clinical trial registries indicate that one-third of trials still remain unpublished three years after conclusion \cite{lehman2012missing}.  Many clinical studies submitted to IRBs produce negative results and are not published \cite{stern1997publication}.  AIDs trials with negative results take about twice as long to publish as those with statistically significant results \cite{ioannidis1998effect}.  Anti-depressant trials submitted to the FDA do not match those published, inflating the effect size in the published literature by about one-third \cite{turner2008selective}.  Three nicotine treatment trials by Pharmacia went unpublished, but the successor and successful treatment was published in the \emph{Journal of the American Medical Association} \cite{vergano2001filed}.  The list goes on, including the anti-depressant paroxetine for children \cite{sussman2004file}, lorcainide for myocardial infarction \cite{yamey1999scientists}, reboxetine \cite{godlee2010missing}, Vioxx \cite{madigan2012underreporting}, and several anti-smoking therapies (naltrexone, mecalymine, and Habitral).

Even Physics, the paragon of science, has a history file-drawer problems.  For example, 40\% of results in one issue of \emph{Review of Particle Physics} were omitted because of ``strong sources of bias'', ``assumptions that the Particle Data Group does not wish to incorporate'' or ``inconsistency with other reported results'' \cite{hedges1987hard}.  In attempting to measure the charge of the electron, Millikan collected data from 140 oil drops but reported only 58, using his own judgment to determine which data were valid and which were invalid \cite{franklin1997millikan}.  Just like psychologists and biomedical researchers, physicists are ``always doing experiments or making observations that disappoint them. They look for some phenomenon or relationship and they do not find it.  Most of these negative experiments are forgotten and the results consigned to the file drawer'' \cite{collins2003lead}.

\section{Proposed Causes}

The two strongest explanations for the file-drawer problem are perverse incentives and error attributions.  Perverse incentives are institutional rewards for reporting only successful findings to others.  Error attributions are cognitive tendencies to see disconfirming evidence as flawed and affirming evidence as flawless.

\subsection{Perverse Incentives}

Simple, ``eye-catching'', and easy-to-comprehend stories yield publications \cite{freedman2010lies}.  These publications, in turn, reward researchers with jobs and funding for further research \cite{mahoney1977publication}.  For example, the American Medical Association reports that greater than 70\% of the funding for pharmaceutical research comes from industry \cite{economist}.  This funding is often dependent on the ability of the researcher to prove they can get positive results \cite{freedman2010lies}.  

On the other hand, publishing results that contradict a flashy hypothesis can mean sacrificing one's career, or even intimidation from those who would rather not see the results published \cite{godlee2012research}.  Any result that suggests a therapy is ineffective puts that company's potential profits at odds with the public's well-being \cite{karlawish2004silence}.  These companies generally distort reports of adverse events, make them difficult to understand, or do not even measure or report them at all \cite{ioannidis2009adverse}.  The pressure to produce positive results comes not only from the medical research companies, but also from researchers who want to save lives, and patients who want to live.  Unfortunately, it is difficult to produce uniformly positive experimental evidence for any theory, even if it is right.  When mixed or disconfirming results occur, it is ``tempting for investigators to submit selected data sets for publication, or even to massage data to fit the underlying hypothesis'' \cite{begley2012drug}.  

Physicists who study gravitational waves are acutely aware of the challenges of institutional incentives to produce flashy results.  Researchers at the Large Interferometer Gravitational Wave Observatory (LIGO) try to detect the presence of gravitational waves, but these waves are so small that they still haven't been detected in over 40 years of searching and with multi-billion dollar research budgets.  Because these physicists never get to produce a discovery, they have difficulty convincing others of the value of their work.  For example, one gravitational wave physicist consistently had his students criticized for not having made a discovery, making it difficult for them to graduate, get jobs, or tenure:

\begin{quote}
And the reason for why it became big, at [my institution], and in my head, was fundamentally because of an incident that happened with two students who had done a beautiful job and they got this shit from my own colleagues. And I said I'm never gonna put students through this again.  If we are going to continue with this, we are going to have to do it on a scale such that even if we don't see anything, no goddam [expletives deleted] theorist, OK, can confront one of my students and say ``What did you discover?'' and give him a sneering, [expletive deleted], ride, OK? So it has to be something where the upper limit is good enough. And you say ``Yeah we have made a scientific statement.'' (pg. 668) \cite{collins2003lead}
\end{quote}

\subsection{Error Attributions}

In any groundbreaking experiment, the difficulties of measurement are typically so extreme that any failed prediction could be attributed to a number of flaws, including bad design, an underpowered study, incorrect analyses, or chance \cite{anestis}.  In these circumstances disconfirming results are both likely to occur and reasonably attributed to error.  Thus, any data sharing policy that omits results that are attributed to error will lead to the file-drawer problem.

Take the famous Michelson-Morley experiment as an example.  This experiment sought to measure the velocity of the earth through the aether, an invisible substance that all matter was hypothesized to be suspended within.  Measurements of the aether worked much like measurements of the velocity of a car by putting one's hand outside the window to measure the force of the wind.  Just as one's hand experiences resistance from the the air outside the car window, the earth was expected to experience resistance from the aether.  However, the theoretical effect of this `aether wind' was expected to be so small that measurement instruments needed to be extremely sensitive; so sensitive that ``a mass of 30 grams placed on the end of one of the arms of an apparatus weighing tons was enough to upset the results dramatically'' (pg. 34) \cite{collins1998golem}.  Sensitivity to vibration was only one among many possible measurement errors, including changes of temperature ``as small as 1/100 of a degree'' that would theoretically triple the effect of the aether wind.  The measurement apparatus also could not be built out of metal, to reduce problems of vibration and increase the weight, because of magnetic fields.  Nor could the device be made out of wood because of sensitivity to humidity.  

In this ocean of possible measurement flaws, the eventual failure to detect the aether wind was disappointing but unsurprising.  All of the experiments conducted by Michelson and Morley, and subsequently by Morley and Miller, were null results, regarded by their creators as failures, and were not followed up by Michelson, as he was ``so disappointed at the result that instead of continuing he immediately set about working on a different problem: the use of the wavelength of light as an absolute measure of length'' \cite{collins1998golem}.  In their time, the failure to measure the aether wind was an anomaly, and was explained as being caused by some combination of the many possible experimental flaws previously mentioned.  However, as it was retrospectively consistent with Einstein's General Theory of Relativity, rather than being flawed, the Michelson-Morley experiments are considered the greatest physics experiments ever conducted.

Another critical test of Einstein's relativity theory had similar problems of measurement error.  The solar eclipse of 1919 allowed Einstein's theory to be compared against Newton's theory of gravitation.  In Newton's theory, gravity should bend light during the eclipse.  However, Einstein's theory proposed additional bending due to the curvature of spacetime.

Sir Arthur Eddington led the main expedition to measure these light deflections during the eclipse (at Principe near Africa), while other research groups took measurements simultaneously (at Sobral in Brazil).  The measurements of these different groups did not agree.  As with the Michelson-Morley experiments, the measurement of the light deflections were extremely difficult, where ``the difference in focal length between a hot and a cold telescope will disturb the apparent position of the stars to a degree which is comparable with the effect that is to be measured'' (pg. 46-47) \cite{collins1998golem}.  Most of the measurements that were inconsistent with Einstein's theory were considered `noisy' (e.g., the Sobral astrographic plates), and removed from the data analysis.  Interestingly, like Millikan in his oil drop experiments, Eddington attributed the results from the Sobral astrographic plates to systematic error, often without being able to explain why (pg. 51) \cite{collins1998golem}.  Like Millikan, Eddington was right. 

Attributions of `noisy' results to measurement error are not always right, however.  Chemist and Nobel Laureate Irving Langmuir personally demonstrated this.  Two chemists, excited about an apparent discovery, asked him to evaluate an effect (the Davis-Barnes effect) that relied on an observer counting flashes through a tube.  After they demonstrated the effect to him, Langmuir concluded hat they hadn't made a discovery, but were instead biased by their expectations.  To prove this, Langmuir secretly changed the pattern of voltages used in the experiment, without notifying the experimenter (Barnes).  Barnes, himself an esteemed professor at Columbia University, counted flashes in a pattern that was unrelated to the voltage changes, the supposed cause.  When Langmuir confronted Barnes about this clear refutation, Barnes immediately generated explanations ad-hoc, that ``the tube was gassy'' and the ``temperature has changed.''  Langmuir called this response, \emph{pathological science}, where Barnes ``immediately---without giving any thought to it had an excuse.  He had a reason for not paying any attention to any wrong results. It just was built into him. He just had worked that way all along and always would. There is no question that he is honest; he believed these things, absolutely''\cite{langmuir1953pathological}.  Langmuir proceeded to write a detailed letter to Barnes, arguing that he was ``counting hallucinations.''  Like Feynman, Langmuir believed that ``men, perfectly honest, enthusiastic over their work, can so completely fool themselves.''

\section{Possible Effects}

The file-drawer problem is likely to have a slow but severe effect on a scientific field, eventually causing it to become completely hobbled or extinct, as was probably the case with Soviet Lysenkoism.  This happens for a number of reasons.  Flawed methods cannot be used to build better ones, instead making every researcher start from scratch \cite{liddle}.  Flashy but wrong theories will be perpetually proposed and the refutations will be perpetually buried, leaving the field in a conceptual stasis \cite{mccormick2007positive}.  Negative results will continue to accumulate if not reported, as researchers  ``who are unaware of the contradictory experimental results repeatedly attempt to confirm or disprove the selected results in the literature'' \cite{rockwell2006publishing}.  Funding cuts will occur when discoveries are repeatedly overturned and replicable results do not surface, as happened with Title VII grants for physician education \cite{probst2006prisoners}.  Honest students will quit, as observed by Wagenmakers, ``I've seen students spending their entire PhD period trying to replicate a phenomenon, failing, and quitting academia because they had nothing to show for their time'' \cite{yong2012replication}.  Those who can get positive results, by ethical or unethical methods, will remain.

The file-drawer problem is particularly harmful in medical research.  Treatments that seem initially useful must be abandoned, wasting time, money, and putting patients at risk, as in the case of zidovudine \cite{ioannidis1998effect}.  Effect sizes are likely to shrink drastically upon replication or uncovering of the file-drawer, with the potential to make subsequent research based off of initially flawed results completely unusable \cite{dirnagl2010fighting}.  The file-drawer problem corrupts meta-analyses needed for evidence-based medical decision-making \cite{hasenboehler2007bias}. This threatens patient safety, and potentially imposes high opportunity costs, as funding is diverted away from the search for real cures to ineffective treatments.  Khan, Khan, and Brown \cite{khan2002placebo} argue that unreported results can make ineffective drugs look effective, and if these are then used as active control groups (i.e., non-inferiority studies) in future studies, rather than using placebo controls, future drugs that are roughly equivalent to the active control (which actually has no effect) will seem to be effective.  They strongly argue against eliminating the use of placebo controls because of the file-drawer problem.

The file-drawer problem is also likely one of the main causes of the poor replicability of published drug trial results.\footnote{This does not apply to drug approval, as FDA requires pre-approval of all trials that will eventually be submitted as evidence.}  Successful replications of research in haematology and oncology have been rare, as only 11\% (6/53) of one systematic replication attempt were successful \cite{begley2012drug}.  Replications of antidepressant trials fared slightly better, with 48\% (45/93) being significantly better than placebo \cite{khan2002placebo}.  Part of this failure might be the fact that ``there are no guidelines that require all data sets to be reported in a paper; often, original data are removed during the peer review and publication process'' \cite{begley2012drug}. 

\section{Overview of the Dissertation}

Researchers and consumers of research are concerned by their experience with the publication system.  This dissertation complements empirical research on the file-drawer problem that address its prevalence and causes \cite{fanelli2010positive,fanelli2012negative,fanelli2010pressures,sterling1959publication}, extends previous philosophical \cite{popper2002logic,poincare1905science,lakatos1980methodology} and mathematical analyses \cite{ioannidis2005most,shaftoepistemic}, and builds on prescriptions based both on method and documentation \cite{stodden2012reproducible,stodden2009enabling,king2007introduction,leisch2002sweave}.

With this in mind, the dissertation explores the following two questions:

\begin{itemize}
\item What data sharing policies emerge in the face of unexpected and unwanted data?  
\item Is reasoning about the validity of data distorted by incentives?  
\end{itemize}

To do this, I use the Normative-Descriptive-Prescriptive (NDP) framework of Behavioral Decision Research \cite{von1986decision,fischhoff2010judgment,bell1988descriptive}: 
\begin{itemize}
\item  \emph{Normative Analysis}: In Part Two of the dissertation, the normative analysis examines how idealized, rational agents should reason and act when sharing data.
  \begin{itemize}
    \item Chapter Two argues that data sharing is an ethical rather than epistemological problem.
      \item Chapter Three provides a normative analysis of data sharing policies using probability calculus.
\end{itemize}
\item \emph{Descriptive Analysis}: In Part Three of the dissertation, the descriptive analysis builds on the normative analysis, trying to understand how humans actually reason and act compared to the ideal.  The descriptive analysis uses two sets of experiments.  
  \begin{itemize}
    \item  Chapter Four presents experiments where participants are asked to attribute the cause of unexpected results in a hypothetical psychology experiment. 
      \item Chapter Five presents experiments where participants discover a rule when there is the possibility of error and there are incentives to share or hide data from others.  
  \end{itemize}  
\item \emph{Prescriptive Analysis}: In Part Four of the dissertation, the prescriptive analysis asks how real people can be brought closer to the ideal.  The prescriptive analysis proposes methods of documenting and communicating uncertainty from scientific experiments.
  \begin{itemize}
  \item Chapter Six presents a simple and open approach to documenting and sharing data. 
  \end{itemize}
\end{itemize}

\part{Normative}
\section*{Introduction to the Normative Analysis}

Part Two of this dissertation discusses when selective reporting, where data are omitted from publication if they conflict with institutional incentives or are attributed to error, is normatively justified.  Chapter Two evaluates data sharing from the viewpoint of philosophy of science and philosophy of statistics, and proposes an ethical standard for data sharing.  Chapter Three evaluates two justifications for not sharing disconfirming data: 1) that disconfirmations are not as informative as affirmations, and 2) that disconfirmations are more likely to be error.  

\chapter{Cleaning the Data or Cooking the Books}

\section{Introduction}

Some scientists are distressed by the lack of rules for data sharing, worsened by the \emph{de facto} action prescribed by their paradigms, where inconvenient data are routinely discarded \cite{spellman2012introduction}.  This distress is warranted.  Fanelli \cite{fanelli2012negative} found that over the last 20 years negative results have begun to disappear from scientific journals, and that this is worse for the social sciences.  Ioannidis \cite{ioannidis2005most} claims that most published research findings in medicine are provably false.  Wasserman \cite{wasserman2012world} even proposed completely eliminating peer review, opting for a more democratic collaborative review system based on pre-prints (\url{http://arxiv.org/}).

This alarming situation is not helped by poorly articulated rules about data sharing provided by scientific bodies like the National Science Foundation or the National Academy of Sciences.  Rather than addressing the real problems scientists face when deciding whether or not to share messy data, these bodies limit their discussions of misconduct to ``actions that are unambiguous, easily documented, and deserving of stern sanctions'' \cite{de2006normal}; that is, Fabrication, Falsification, and Plagiarism.

Instead, researchers must decide whether removing data from a published report would constitute ``cleaning the data or cooking the books'' \cite{de2006normal}.  These decisions are common, ambiguous, and have important consequences.  Poor data sharing policies undermine the effective communication of scientific research.  Suppressing unwanted data, for example that contradict a favored theory, waste the resources of those who try to build from it.  However, sharing too much data can confuse our peers with an incomprehensible and unusable morass of `mere facts' \cite{kuhn1996structure}, while sharing faulty data can lead others to draw incorrect conclusions.

In this chapter I argue that sharing data is an ethical decision, where one chooses to minimize the chance of deceiving one's reader.  Any policy for determining when data are faulty and should be excluded from communication is based on \emph{convention}, and scientists may reasonably disagree on the conventions they find acceptable.  Methods of cleaning data that irrevocably impose conventions on the reader make the `cleaned' results deceptive and thus unethical.  

\section{Unwanted data}

When shared with others, data that affirm a scientist's theory are likely to yield rewards, whereas data that suggest this theory is false will not.  Although ideally the scientific community benefits the most from data that make the evaluation of theories clear, affirming or not, in practice refutations are not perceived as providing this clarity.  As Lakatos argued, ``there is no falsification before the emergence of a better theory'' \cite{lakatos1980methodology}, which, if taken seriously, means that refutations are valueless unless accompanied by affirmation.  Thus, scientists set perverse incentives for themselves and each other, where only convincing data are rewarded.  In turn, they are provided only that, in the form of affirming results.  Because refutations are valueless (or even of negative value), the interests of the scientific community and individual researcher are often at odds.

Take for example a researcher that proposes a specific hypothesis about behavior, adhering to the assumptions of the members of her community (e.g., her lab, her advisor).  This may be that children perform Bayesian causal induction, that unconscious priming can be linked to sensory perception, or that prejudice can arise from arbitrary group distinctions.  If the experimenter's specific hypothesis, derived from the paradigm of the community, is not supported by the data, it also casts doubt her abilities.  Her lab members may suggest her experiment was poorly conducted or her calculations were incorrect, giving the researcher a bad reputation.  As Kuhn and others have argued, once a paradigm is established and expectations are entrenched, failing to get an expected result is a ``failure of the scientist'', (pg. 35) \cite{kuhn1996structure}, and ``discredits only the scientist and not the theory'' (pg. 80).  Thus, a report of any result other than one that affirms the paradigm is merely a statement about the poor quality of the researcher. In contrast, if her hypothesis is supported, she is encouraged to publish the paper, and rewarded with future employment and funding.

In such situations, data sharing is a signaling game.  In the signaling game, the researcher has private information (the data she's collected) about her hypothesis, and the community wants to reward researchers if their hypothesis is interesting and the data support it, and not otherwise \cite{spence1973job}.  The obvious solution is for the community to look at the data the researcher shares (the signal), hoping it will clarify the value of the researcher's hypothesis.  However, if the cost to the researcher of omitting falsifying data is low enough, then she can always create a flawless picture of her theory with data, guaranteeing reward.  In this case, the signal (data) is meaningless, and the community would be reasonable to ignore any data provided by the researcher.  Since both the researcher and community do not know whether the researcher's hypothesis is correct (although the researcher has suggestive evidence), the community is forced to evaluate her hypothesis purely on a priori grounds, such as how interesting or surprising the theory is.

Even altruistic researchers, who are willing to be surprised and make discoveries, are affected by these perverse incentives.  This is because they may only be able to get a hearing for their work if they achieve and maintain status as a respected scientist.  In the archetypal communities described by Kuhn, that would mean avoiding results and discoveries that challenged the paradigm.  In signaling game terms, if the scientific community does not value anomalous data, it would be in the interest of scientists who genuinely care about discovery to not share them.  In sum, scientists are incentivized to not share disconfirming data.  As long as these perverse incentives exist, ethical researchers are likely to suffer.

\section{Incomprehensible data}

If sharing data were purely about incentives, then changing incentives could possibly provide an easy solution to problems of publication bias.  However, the incentive dilemma is hidden in an epistemic one.  Data that do not affirm a theory are also more difficult to make sense of than those that demonstrate the expected.  

Difficulty determining the meaning of data are severe and persistent problems that are often at the heart of scientific and philosophical debates.  Philosophers of science and statistics, from Laplace \cite{laplace1806marquis} to Gelman and Shalizi \cite{gelman2010philosophy}, have grappled unsuccessfully with making sense of anomalous data.  To see the ethical nature of data sharing, the epistemic veil surrounding data sharing decisions must be lifted.    

\subsection{Kuhn's Paradigms}

The interpretability of data, or lack thereof, is determined by paradigm.  The paradigm prescribes what to look at, how to take measurements, and resolves many other methodological and theoretical choices (sometimes called the ``frame problem'' \cite{pylyshyn1987robot}).  Experiments that result in a known outcome within a paradigm are the only ones that scientists can unambiguously make sense of, as the paradigm serves to prepare the mind of the researcher to spot the type of phenomena the paradigm prescribes, while blinding the researcher to other types. 

When there is no paradigm, useful data are hard to spot.  Different scientists will observe ``the same range of phenomena [and] describe and interpret them in different ways'' (pg. 17) \cite{kuhn1996structure}.  Fact-gathering is a relatively random and undirected process.   

When there is a paradigm, the meaning of data are still problematic.  Observations that don't fit into the paradigm are not necessarily reshaped to fit what was expected.  Instead they may not even be seen, as data are partial records that are ``immensely circumstantial'' (pg. 16) \cite{kuhn1996structure}.  As a result the details needed to recognize anomaly may not even be recorded.  That is, a paradigm can preclude even seeing potentially anomalous data.  

The data may not make sense, even if there is a paradigm and the data are seen.  Without an ability to generate causal explanations, data are ``unrelated and unrelatable'' mere facts, or even worse ``not quite a scientific fact at all'' \cite{kuhn1996structure}.  Instead, paradigmatic explanations that have previously succeeded will be invoked to explain anomalous data.  For example, Joseph Priestley did not discover Oxygen, but instead another instance of his theory: ``air with less than its usual quantity of phlogiston'' (pg 53) \cite{kuhn1996structure}.

Once data are recognized as anomalous they must be reconciled with theory.  This can happen either because the theory is wrong, the data are flawed, or both.  Any conflict between theory and data is really a conflict between an explanatory theory of the causal mechanisms of interest and an interpretive theory of the meaning of the data \cite{lakatos1980methodology}.  At the most basic level, data are composed of observations that assume one's visual and cognitive perceptions are accurate.  At higher levels, entire theories of observation and instrumentation are used.  For example, Galileo's observations of ``mountains on the moon and spots on the sun'' were aided by a telescope, and thus the ``optical theory of the telescope'' (pg. 15) \cite{lakatos1980methodology}.  As a result, when observation and theory do not align, there is not a conflict between fact and theory, but between two theories.

Data sharing policies are ambiguous because there is no basis for privileging statements based on an observational theory, such as the optical theory of the telescope, over other ``non-observational'' theories, such as Newton's laws.  No appeal to psychology could solve this problem, because the psychology of observation is itself dependent on an observational theory.  All observations are theory-laden, and empiricism must assume a ``psychology of observation'' that is hoped to be accurate \cite{lakatos1980methodology}.  There is no logical way to determine whether theory or data were wrong, meaning there is no purely logical guide to data sharing.  

Instead, when theory and observation conflict, resolution must occur on extra-logical grounds; by ineffable tacit knowledge \cite{polanyi1998personal}, severe tests of alternative explanations for the data \cite{mayo1996error}, what has traditionally been done \cite{kuhn1996structure}, the sagacity \cite{duhem1991aim} or intuition \cite{popper2002logic} of the scientist, or what provides the most convenient and easy-to-work-with explanation \cite{poincare1905science}.  A decision must be made about whether to retain theory or observation \cite{popper2002logic}.

\subsection{Poincare's Conventionalism}

The dominant approach to dealing with the conflict between observational and non-observational theories is \emph{conventionalism}.  In contrast to justificationism, which permits only logically valid arguments, conventionalism allows some knowledge to progress by decision, without having to provide reason.  Conventionalism chooses the meaning of data based on convenience; what is easiest to work with and understand.

Take for example Henri Poincare's \cite{poincare1905science} discussion of choosing between Euclidean or non-Euclidean geometries.  According to Poincare, if one discovered ``negative parallaxes'' or proved that ``all parallaxes are higher than a certain limit'', this would not disprove Euclidian geometry, because we could also ``modify the laws of optics, and suppose that light is not rigorously propagated in a straight line.''  This conclusion is reasonable, and possibly true, but not provable without also using geometric assumptions.  Instead, the decision to throw out the theory or the data depends on which is ``more advantageous'', or convenient, and in this way ``Euclidean geometry has nothing to fear from fresh experiments'' (pg. 73).  

Poincare points out that convention is not logically justifiable.  Decisions are not ``synthetic a priori intuitions,'' and are also not ``experimental facts''.  However, conventions are not arbitrary since they have an ``experimental origin'' (pg. 110).  Poincare only allows decisions to be made about the fundamental theories, where ``experiment may serve as a basis for the principles of mechanics, and yet will never invalidate them'' (pg. 106). 

To Poincare, experiments are used as conventions to form the basis of theories, but not to refute or invalidate them.  Conventionalism allows some data to be meaningless or useless to the person collecting the data, but quite useful to someone who holds different conventions.  Failing to share data means imposing one's conventions on the reader, conventions that, if possible, require explicit articulation by the author.

\subsection{Popper's Falsificationism}

Karl Popper's viewpoint was that only refutations, or falsifications, were valuable data, as only they could conclusively disprove theories.  This focus on falsification rather than affirmation was a radical change in philosophy of science \cite{popper2002logic}, as he changed the status of affirmative data from the most valuable (as held by the positivists) to useless.  

To understand Popper's approach, consider the following (universal) statement: if the weight placed on a string is greater than its tensile strength, the string will break.  Now, consider a singular statement: we placed an object on a string that has greater weight than the tensile strength of the string.  Putting these two statements together yields the prediction that the string will break.  In this example, if the string does not break, then the universal statement is disproved.  This type of simple syllogistic reasoning is the foundation of Popper's approach.  It is deductive because the universal statement cannot be verified (inductively) by finding every string and every object that could be put on the string and measuring whether the string breaks when the object is placed on it.  This is because the universal statement applies to all strings and all objects; we cannot search all space and time for each object and string to verify the universal statement.

In contrast, singular statements (e.g., the string broke) refer to a specific space and time.  It is simple enough to verify a singular statement: we only need one instance of a string and one instance of an object at any space and at any time.  Popper calls the asymmetry between being able to verify singular but not universal statements, \emph{unilateral decidability}.  It is unilateral decidability, where one can verify a singular but not a universal statement, that allows a theory to be falsified, but not verified, by observation.

To make his approach empirical, Popper talks of observations as occurrences.  For example, ``I observe a glass containing water at noon on Saturday in Pittsburgh'' is an occurrence because it is a spatio-temporally limited singular statement.  An event is a class of occurrences that have the same form but differ only in individual names (e.g., observing a glass of water at any time).  Popper thus calls potentially falsifying observations \emph{basic statements}, which are singular statements ``asserting that an observable event is occurring in a certain individual region of space and time'' (pg. 85).

Popper refers to ``inter-subjective'' agreement to determine whether basic statements are true; that is, multiple people see an occurrence and agree that it has occurred.  If there is disagreement, the basic statements can be tested against other basic statements, ad infinitum, in the hope of reaching inter-subjective agreement.  Eventually, testing will resolve to observations where everyone agrees, that are ``easy to test.''  Once an easy-to-test statement is verified, the entire chain of tests that led to it can conclusively resolve in falsification.  If one can't come to this point, then Popper believes either the phenomenon was not inter-subjectively testable, not observable, or that language (communication) has trapped us.  Popper calls inter-subjectively agreeable basic statements that are repeatedly demonstrated and are inconsistent with a theory \emph{reproducible effects}.  Only these reproducible effects can falsify a hypothesis.

In sum, Popper's Falsificationism proposes that a theory, once sufficiently axiomatized and checked for consistency (i.e., it is not self-contradictory), to be falsifiable and thus scientific, it needs to rule out not just occurrences, but at least one event.  If this event occurs the theory (universal statement) is falsified.  The only important data are falsifying events, and theories need only be falsifiable to be scientific. 

Although it appears to provide an unambiguous policy for data sharing (i.e., share only falsifying reproducible effects), his approach eventually failed.  This is because at any point there is no logical way of disproving what Popper called ``conventionalist strategems'' that could be used to save a theory against falsification.  The four ``main'' stratagems were:  1) introducing ad hoc auxiliary hypotheses that make the theory consistent with falsifying data, 2) changing the ostensive definitions of the data or the theory to make a falsification into verification, 3) challenging the data directly, as being ``insufficiently supported, unscientific, or not objective, or even on the ground that the experimenter was a liar'' (pg. 60-61), or 4) a claim that the theory was misapplied or misinterpreted by a fallible theoretician.  Popper warned social scientists about these stratagems (pg. 62).  Indeed, they appear to be common reasons for not sharing data as each conventionalist stratagem can be invoked to argue that the data should not be shared with the scientific community.

Popper's way of avoiding the infinite regress of trying to prove an observation with other observations was to accept a limited form of conventionalism.  According to Popper, one could decide to accept a basic statement as true without further justification.  These are called \emph{accepted} basic statements.  All other statements must be proved and subjected to additional ``inter-subjective tests'' or provisional agreement.  Popper's conventionalism differs from Poincare's in that Popper allows these decisions to only apply to observables (basic statements) rather than theories (universal statements).

\subsection{Lakatos' Research Programmes}

Popper's student, Imre Lakatos, recognized the shortcomings of both Poincare and Popper's perspectives.  For Lakatos, Poincare is too conservative: a theory can always be preserved as long as it is convenient to do so.  Popper's is, in contrast, too risky: A falsifying statement could, if accepted based on a fallible convention, rule out a true theory.  Instead, for Lakatos the only tenable position is that ``all propositions in science are fallible'' (pg. 19) \cite{lakatos1980methodology}; that is ``scientific theories are not only equally unprovable, and equally improbable, but they are also equally undisprovable.'' 

Lakatos, like Popper, saw conventionalist decisions as a necessary part of a philosophy of science.  He solves Popper's problem by requiring that any conventionalist decision take into account whether the decision leads to the prediction of new facts.  These new facts must be ``improbable or impossible'' without using the newly modified theory.  If the modification does lead to the prediction of such new facts, Lakatos considers the decision \emph{theoretically progressive} and thus admissible.  This is Lakatos' approach: a new theory must explain at least as much as the old one and also predict new facts at a rate that outpaces the adjustment of the theory to fit anomalies.  For Lakatos, to allow refutation without creatively (progressively) proposing an ad-hoc defense ``shows nothing but the poverty of our imagination'' (pg. 35) \cite{lakatos1980methodology}.  For Lakatos, stunning predictions are the valuable data that should be shared.
 
\subsection{Summary}

Kuhn, Poincare, Popper, and Lakatos provide many arguments why data may be invalidated.  Kuhn tells us that we may not see them, they may not make sense outside our paradigm, or we may be convicted as incompetent if we do share them.  Poincare licenses us to discard any data if it is convenient to do so, so as to maintain a tractable theory.  Popper tells us that only falsifiers matter, based on reproducible effects.  Lakatos only cares about the prediction of new facts as important data.  All of these approaches appeal to conventions, rather than logical justifications, to reject data.

\section{Statistical Decisions}

Lakatos and Popper both wanted a formal mathematical approach that could assist scientists in resolving conflicts between data and theory.  In parallel, but mostly independent of these philosophical discussions, statisticians provided such rules.  The \emph{Frequentist} approach makes decisions about data that are not statements about the truth value of a hypothesis directly, but instead are justifications for provisional rejection of a statistical hypothesis with known error rates.  The \emph{Bayesian} approach, on the other hand, allows truth values to be assigned to hypotheses directly in terms of subjective probabilities. 

\subsection{Frequentist}
\subsubsection{Fisher's null hypothesis significance testing}  

In the early 20th century, around the time when Popper elaborated Falsificationism, R.A. Fisher developed an approach for statistical hypothesis testing to resolve conventionalist decisions about conflicting theory and data. \footnote{While founding modern statistics \cite{radhakrishna1992ra,savage1976rereading} by inventing and refining concepts of sufficiency, consistency, efficiency, maximum likelihood estimation, deriving sampling distributions, and promoting randomization in experimental design, he also was engaged in bitter conflicts with other scientists, including physicists Arthur Eddington and Harold Jeffreys, and statisticians Karl Pearson, and especially Jerzy Neyman \cite{lehmann1993fisher}.  His personal and derisive attacks on the Neyman-Pearson Frequentists, Subjective Bayesians, and Objective Bayesians, still casts a shadow over debates between these factions on hypothesis testing and statistical induction.  In light of this, it is not surprising to find that Fisher's views, often intentionally extreme to avoid concessions to others \cite{savage1976rereading,zabell1992ra}, can be easily misinterpreted as justifications for not sharing data.}

Fisher's approach, \emph{null hypothesis significance testing}, was to only accept statements that indicate a well-defined sample of data are inconsistent with a \emph{null hypothesis} (e.g., that two sample proportions are equal).  In this approach, the theoretical frequency distribution of the null hypothesis is constructed and the probability of observing the sample data given this distribution is calculated.  If this probability is sufficiently small (usually less than 5\%), the data are judged to be inconsistent with the null hypothesis, or \emph{statistically significant}, and the null hypothesis is rejected.

Fisher believed that null hypothesis significance testing provided exactly what scientists needed to resolve the conflict between theory and data, ``simple rejection of a hypothesis, at an assigned level of significance'' (pg. 40) \cite{fisher1956statistical}.  To him, this is ``all that is needed, and all that is proper, for the consideration of a hypothesis in relation to the body of experimental data available'' (pg. 40).  Like the conventionalist approaches of Popper and Lakatos, this rejection is tentative, where ``no irreversible decision has been taken'' (pg. 38) \cite{fisher1956statistical}.

Data that fail to reject the null hypothesis demonstrate a lack of experimental understanding, as experimental knowledge is gained from knowing ``how to conduct an experiment which will rarely fail to give us a statistically significant result'' (pg. 14) \cite{fisher1935design}.  Aside from rejecting the null hypothesis, there is no other purpose of an experiment which ``may be said to exist only in order to give the facts a chance of disproving the null hypothesis'' (pg. 16) \cite{fisher1935design}.\footnote{As to uncontrolled causes, or auxiliary hypotheses, Fisher argued for randomization rather than coming up with ``an exhaustive list of such possible differences appropriate to any one kind of experiment, because the uncontrolled causes which may influence the result are always strictly innumerable'' (pg. 18) \cite{fisher1935design}.  Fisher also attempted to derive an objective method of inverse probability, where statistical hypotheses could be given probability distributions.  He called this fiducial probability.  Suppose one has a pivotal quantity (a function of a statistic and parameter whose distribution does not depend on the parameter).  This pivotal quantity can be inverted and a distribution can be solved for the parameter, giving it a fiducial, rather than posterior, distribution.  From this fiducial distribution, the probability of the parameter lying in any region of the distribution can be calculated.  Fisher's fiducial argument to confidence interval estimation is now used, and is equivalent to Neyman`s unconditional confidence interval \cite{neyman1937outline,zabell1992ra} although their interpretations were different.}  Based on this reasoning, Fisher's prescription is to only to share results that are statistically significant, and to ignore the rest (pg. 1244 as cited in \cite{lehmann1993fisher}):
\begin{quote}
``it is usual and convenient for experimenters to take $5$ percent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard...If $P$ is between $0.1$ and $0.9$, there is certainly no reason to suspect the hypothesis tested.''
\end{quote}

This pervasive idea may be the single strongest convention that determines data sharing since Fisher popularized his approach.  Not sharing data that are not statistically significant may be \emph{the} data sharing problem, one that Fisher invented and advocated.

\subsubsection{Neyman-Pearson Powerful Tests}

Jerzy Neyman and Egon Pearson (son of Karl Pearson) modified the Frequentist foundations Fisher set down \cite{neyman1933problem}.  They agreed with Fisher that it is possible to use mathematics to guide decision about statistical hypotheses, but instead appealed to a decision-theoretic account of hypothesis testing.  Their approach was concerned with creating a pre-planned analysis that could minimize errors from two types of decisions: 1) concluding that the null hypothesis is false when it is true (Type 1 Error), and 2) concluding that an alternative hypothesis is false when it is true (Type 2 Error).  To do this, the experimenter judges which error is more important, and chooses an error level that this test must not exceed (Type 1 Error or $\alpha$).  Once this is determined, a test statistic is created so as to minimize Type 2 Error ($\beta$).  This test is called the \emph{most powerful test of level alpha}.  

To calculate a most powerful test, at least one alternative hypothesis ($\neg H$) must be proposed, otherwise the ``problem of an optimal test of $H$ is indeterminate'' (pg. 104) \cite{neyman1977frequentist}.  In contrast, Fisher was content to only specify a null hypothesis $H$ and the complement $\neg H$ without any specific alternatives.  By calculating the Type 1 and Type 2 error levels, one can get an idea of how often false-positive and false-negative decisions will be made.  The calculation of alpha and beta levels also provide guides for fixing one's experimental design by  ``(i) alter[ing] the design of the experiment, (ii) try[ing] to find a more powerful test, (iii) increas[ing] the level of significance and (iv) increas[ing] the number of observations.'' (pg. 107) \cite{neyman1977frequentist}.

Thus, Neyman and Pearson developed a method for maximizing the chance that a result will be statistically significant given a specified set of alternative hypotheses.  In this sense, they constrain the data sharing problem to sharing statistically significant results among studies that have high power.  If this rule were followed, more results would be shared if studies could be planned with high power, or less data would be shared if higher power could not be achieved.

\subsubsection{Mayo's Error Statistics and Severe Tests}

Deborah Mayo \cite{mayo1996error} proposed a philosophy of statistics that generalizes the decision-theoretic approach of Neyman and Pearson.  Her proposal is that if a \emph{severe test} is conducted, which has a low risk of making Type 1 and Type 2 errors, and a theory is not refuted by this severe test, then there is good evidence that the theory is true.  That is, Mayo's severe test is one where a hypothesis would have a high probability of being rejected if it were false in the light of inconsistent data and a low probability of being rejected if it were true.  A hypothesis severely tested in this way is not the same as Popper's ``all theories yet to be refuted'' because the severe tests are custom designed to disprove the hypothesis.

In this approach, what separates science from pseudoscience is the ability to learn from error or failed predictions.  If a failed prediction gives ``rise to a fairly well defined problem; specifically, the problem of how to explain it'' (pg. 33) and can be ``pinned to a specific hypothesis'' (pg. 34), then the approach is scientific.  Possibly failed auxiliary hypotheses, or potential sources of error, are separated and given thorough methodological and statistical examination with the goal of producing reliable experimental knowledge that remains even after the theories that inspired the experiments are proven false.

Mayo explicitly rejects any communication of data that ``prevents the determination of valid error probabilities'' (pg 297), for example by ``treating pre-designated and post-designated tests alike'' (pg. 296).  Thus, for Mayo, any procedure that invalidates error probabilities, which failing to share data arguably does, is inadmissible.  

\subsubsection{Summary}

The Frequentists argue that if data are very unlikely given a hypothesis (i.e., have a low p-value) then this is evidence, but not proof, that the hypothesis is false.  While the decision about the truth of the hypothesis itself depends on broader (non-statistical) scientific judgment, significance tests can provide the best formal guide.  Statistically significant data are seen as meaningful and worth sharing with the scientific community, whereas non-significant data are not.  From Fisher's point of view, failing to get statistically significant results means one does not understand the experiment that was conducted, and one should try again rather than inform others about this failure.  Fisher's mathematical viewpoint provides easy justifications, documented and analyzed by Greenwald \cite{greenwald1975consequences}\footnote{He argues that psychologists see data as synonymous with the hypothesis the researcher wants to test.  Data that fail to reject the null hypothesis are seen as equivalent to data that support the null hypothesis, which the researcher literally believes to be false.  Thus, non-significant data are useless.  He also summarizes three other arguments made by psychologists against reporting non-significant results: 1) that finding non-significant results is not a discovery and does not advance science, therefore it is not worth reporting; 2) that statistical significance is evidence of correct experimental design and hypothesis; and 3) that ``there are too many ways (including incompetence of the researcher), other than the null hypothesis being true, for obtaining a null result.'' He provides refutations for each of these arguments, the strongest being refutation of the third: while it is true that the incompetence of a novice is likely to lead to noisy results and failure to reject the null  hypothesis, incompetence of the novice and expert both can lead to systematic bias that can support one's hypothesis.  That is, errors due to incompetence cut both ways.}, for sweeping confusing and unexpected results under the rug.    

\subsection{Bayesian}

The alternative approach to making statistical decisions is the Bayesian one (also called inverse probability).  It is more flexible, and more audacious, allowing probabilities to be assigned to explanatory hypotheses directly.  In this paradigm, probabilities are not long-run relative frequencies of events, but instead, ``relative, in part to ignorance, in part to our knowledge'' (pg. 6) \cite{laplace1806marquis}.  That is, probabilities are epistemic states of a person.  

For example, suppose a person flips a fair coin which has known Frequentist properties (e.g., in an infinite sequence of identical flips of this coin the relative frequency of heads to tails is 0.5).  However, the coin flipper views the outcome of the flip but does not tell us.  Clearly the probability he assigns to heads or tails is either 1 or 0, but our probability has not changed.  His probability is not the same as ours.  Thus, ``probability is a function not only of the coin, but also of the information to the person whose probability it is.  Thus subjectivity occurs, even in the single flip of a fair coin, because each person can have different information and beliefs'' (pg. 5) \cite{kadane2011principles}.

There are two main schools of Bayesian statistics: \emph{Subjectivist} and \emph{Objectivist}.  Both schools agree on two fundamental elements of probabilistic beliefs: 1) that they must be coherent, and 2) that they must operate according to Bayes' Rule.  Coherence assures that one cannot be given a series of bets that guarantee a loss of money.  Bayes' Rule guarantees that coherence is maintained in the light of new data (conditioning).  What the two schools of Bayesian statistics disagree on is where prior probabilities (or prior total evidence; \cite{seidenfeld1979not}) should come from; that is, from `objective' or `subjective' sources.

\subsubsection{Subjectivist}
 
Subjectivist Bayesians argue that hypotheses have truth value based on subjective belief.  For them, scientific judgment depends on beliefs that are ``yours alone, and need not be the same as what someone else would say, even someone with the same information as you have, and facing the same decisions'' (pg. 1) \cite{kadane2011principles}.  When deciding whether to believe an experimental result, the Subjectivist Bayesian implicitly or explicitly assigns a subjective probability to hypotheses and data directly.  This subjective probability can be elicited, with varying degrees of success, using proper scoring rules (e.g., \cite{o2006uncertain}).  By assigning prior probabilities and likelihood functions to each hypothesis under consideration, the Subjectivist can deductively derive the posterior probabilities of each hypothesis given new data.  This posterior probability is directly applicable to any hypothesis; that is, the Subjectivist Bayesian can hold a probability of an explanatory theory, which Frequentists and Falsificationists will not do.  Because of this, the Subjectivist Bayesian approach has great advantages in terms of representing sparse or unobservable data, while additionally taking into account beliefs and other factors that play into a decision that a Frequentist would not admit.  

This approach also appears\footnote{The ability to accommodate ad-hoc auxiliary hypotheses is severely limited, either requiring logically omniscient priors or uncomputable functions \cite{danks2008explaining}, or if the true hypothesis is not in the support for a prior, a Bayesian has little hope of discovering the cause, regardless of the amount of data obtained \cite{gelman2010philosophy}.  So, ``thinking Bayesian'' is only provisionally helpful, affording us formal inductive rules only as long as we are willing to pretend that we are logically omniscient.} to have a built-in way of dealing with ad-hoc auxiliary hypotheses, Popper's first conventionalist stratagem.  Each auxiliary hypothesis introduced ad-hoc must have a prior probability assigned to it, which is likely to be small, since it was not considered ex ante (by definition, it wasn't considered ex ante because it is ad-hoc).

To a Subjectivist Bayesian, data are valuable if they change one's posterior beliefs.  If the data produce no change, then they produce no value, as one would not be willing to pay money to reveal the outcome of an experiment that could produce these data.  When sharing data, the Subjectivist Bayesian cares about her beliefs about what other people believe.  If one believes that data will change the posterior beliefs of others, then the data are expected to be valuable to them.  The adoption of personal beliefs about the value of data to oneself and others can be seen as a special type of personal convention that may be justifiable to oneself, but not to others.

\subsubsection{Objectivist}

Since its inception, the Subjectivist approach has been surrounded by controversy.  It was seen as adding unwarranted ``subjectivity'' into science, which was meant to be objective in principle.  As a result it was strongly rejected by many philosophers and scientists, including most of those previously mentioned.  For example, Fisher's problem with using Bayes' Rule for inference was due to ``lack of experimental knowledge'' about the prior \cite{fisher1956statistical}.  If there was some way of establishing the long-run frequency of information in the prior, Fisher would undoubtedly accept that.  Without such proof, Fisher found no use for priors.

To deal with this, the Objectivist Bayesians argue that in many circumstances we should have a \emph{unique} posterior belief that is determined by an objective prior, called the \emph{informationless prior}.  The intent was to provide a method where different observers arrive at the same conclusions (posterior distributions) after observing some data because they all agree that there is one correct prior.  That is, the Objectivist Bayesians tried to come up with a method where prior belief is not purely determined by the opinions of the decision-maker.  

Although the use of informationless priors can be useful, it is contradictory in many circumstances \cite{seidenfeld1979not}.  This includes the three major approaches to informationless priors: 1) Laplace's \cite{laplace1806marquis} \emph{principle of insufficient reason}, 2) the \emph{invariance principle} of Jeffreys \cite{jeffreys1946invariant}, and 3) the \emph{principle of maximum entropy} of Jaynes \cite{jaynes1963information}.  While the ``informationless'' priors they derive end up being quite simple (e.g., a uniform distribution), the justifications for these priors are quite complex and technical.  Each has its advantages and disadvantages, and their derivations are out of the scope of this paper.  

Just as the Subjectivists hold personal prior beliefs, Objectivist Bayesians make the choice of prior beliefs by appealing to reasonable principles that select informationless priors.  Like all those mentioned before, the choice of `informationless prior' has no unique stance among priors, making it a conventionalist decision.

\section{Ethical Conventionalism}

Each of the philosophical and statistical approaches discussed can be used to argue for data sharing policies that invoke some form of \emph{conventionalism}: Poincare's chooses based on what is easiest to work with, Popper's chooses falsifiers, Lakatos' chooses successful predictions, Fisher's chooses statistically significant results, Neyman and Pearson's choose statistically significant results with high power, Mayo's chooses severe tests, Bayesians choose priors and value data that affect the posterior beliefs of others.  None of these approaches resolve the conflict between theory and data without a conventionalist choice.  As a result, any decision made by a researcher to report or omit data involves a convention that is implicitly or explicitly imposed on the reader.  These conventions can be very effective, making the interpretation of the data easy, but can be very misleading if the reader either does not know about the reporting conventions or does not hold the same conventions.  For example, someone holding Fisher's convention of only sharing statistically significant results could mislead a Bayesian, who may find very few data points (even as little as one data point that cannot even be given a p-value in Fisher's convention) very useful.

Thus, any approach that imposes conventions on the reader without making them explicit or allowing the reader to examine the data according to her own conventions is deceptive, and as a result, unethical.  To avoid this possible deception, any ethical data sharing policy must not force conventions on the reader.  Ethical data sharing allows the reader, if he or she desires, to reconstruct the original data without convention, or with minimal convention imposed by the data sharer.  For this reason, even if the meaning of data are unclear, they still need to be documented.  Even though the documentation is highly theory-laden, driven by the paradigm, the trace provided in the scientific record can be reconstructed by those with differing opinions about appropriate convention. 

This perspective resolves important questions about data sharing, such as whether we should report `warm-up' experiments that preceded our `main' experiments that demonstrate a discovery.  This is because, among the available conventions held in the community, data must be documented and made retrievable (although not necessarily included in the main analysis) if another member of the community would consider that data as evidence.  The ethical data sharing policy depends on the members of the community and their conventions.  If reasonable members of the community hold that conversations with strangers or spouses count as evidence, then these must be properly documented.  Most often, documentation will begin when data are collected after instruments are developed, or during the process of developing measurement instruments.  Thus, as most people in a community would find each experiment in a series of experiments informative, it would violate their conventions to omit these data.  Given this policy, the reader can make a realistic judgment about the validity of the researcher's hypothesis without becoming confused or misdirected by too much, highly uncertain, or weak evidence.  This also makes clear the community's duty to clearly articulate and compile the conventions of their members, as is done in the CONSORT statement \cite{schulz2010consort}.

\section{Conclusion}

I conclude by proposing three simple rules to guide ethical data sharing.  These rules follow from the principle that any data sharing policy, to be ethical, must not impose conventions on the reader:

\begin{itemize}
\item \emph{Rule 1}: Communicate research by your own conventions, making them as explicit as possible.    
\item \emph{Rule 2}: Provide justification for and documentation of these conventions.  That is, for all data omitted that another person in the community could want, specify why this data was not shared.
\item \emph{Rule 3}: Provide a traceable account of other data not extensively detailed so that others can examine them according to their own conventions.
\end{itemize}

When the veil of epistemic and game-theoretic concerns is removed, data sharing is a question of ethics.  It is about honesty.  It is about not fooling ourselves and others.  
 
\chapter{Rational Analyses of Data Sharing}
\section{Introduction}

The work in this chapter extends the work of Overall \cite{overall1969classical}, Ioannidis \cite{ioannidis2005most} and Shafto \cite{shaftoepistemic} by evaluating four normative questions of data sharing: 
\begin{enumerate}
  \item Are disconfirmations less informative than affirmations, and thus less worthy of sharing?
  \item Are disconfirmations more likely to be error than affirmations, and thus less worthy of sharing?
\end{enumerate}

Throughout this Chapter, I use Wason's 2-4-6 rule discovery task\footnote{In the Wason task, a person is given the numbers 2-4-6 and told that they were generated by a hidden rule.  The person can then propose new sets of three numbers and get feedback on whether those numbers also fit the rule.}, the Neyman-Pearson decision-theoretic approach to hypothesis testing, and Bayesian analysis.  

\section{The Differential Diagnosticity Conjecture}
One argument against sharing disconfirming data is the \emph{Differential Diagnosticity Conjecture} ($DDC$): affirmation of a hypothesis is more informative than disconfirmation.  As a result, disconfirming data need not be shared with the scientific community.  The most general version of this conjecture can be formulated in Bayesian terms, to make the notion of `informativeness' precise, in the following way: data that are high probability under our hypothesis change our posterior beliefs more than data that are low probability ($DDC1$).  

\subsection{A Simple Case}
In the simplest case, suppose, as Popper does, a universal statement is our hypothesis: All swans are white ($H$).  The complement to this hypothesis is a singular existential statement: There exists a non-white swan ($\neg H$).  In this case, $H$ assigns probability 1 to all white swans and probability 0 to all non-white swans.  The informativeness of data will depend solely on the prior probability of $H$, $P(H)$.  If $P(H)>0.5$, then disconfirmation is more informative than affirmation because $P(H)-0>1-P(H)$.  The opposite also holds.  Thus, if we are willing to admit a prior probability of our hypothesis, $P(H)$, then $DDC1$ holds or does not hold in an arbitrary manner depending only on our prior beliefs.

\subsection{Differential Diagnosticity Conjecture 1}
In a more general case one can allow $H$ to assign high probability (but not necessarily 1) to white swans, and low (but not necessarily zero) probability to non-white swans.  Consider the absolute difference between prior, $P(H)$, and posterior, $P(H|D)$, probabilities of some hypothesis $H$ given some new data $D$ as a measure of informativeness.  This difference measure is called $d$ and it has some nice properties which make it preferable to alternatives such as the log-ratio, log-likelihood ratio, and Carnap's \emph{r} \cite{fitelson1999plurality}.

Let the informativeness of new data $D$, using the $d$ measure mentioned above, called $d(H|D)$ normalized with the $L_{1}$ norm for simplicity ($\|X\|_{1}$ is a fancy way of saying the absolute value of $X$), be as follows:

\begin{equation}
  d(H|D)=\|P(H|D)-P(H)\|_{1}=\|\frac{P(D|H)P(H)}{P(D|H)P(H)+P(D|\neg H)P(\neg H)}-P(H)\|_{1}
\end{equation}

\begin{flushleft}
  To simplify, substitute the symbols $\alpha = P(D|H)$, $\beta = P(D|\neg H)$ and $x = P(H)$ giving:
\end{flushleft}

\begin{equation}
  d(H|D)=\|\frac{\alpha x}{\alpha x + \beta(1-x)}-x\|_{1}
\end{equation} 

<<fig31,echo=false,fig=false,results=hide>>=
#unset hidden3d
#unset hidden
#unset surface
#unset colorbox
#unset key
#set pm3d
#set style line 100 lt 5 lw 0.5
#set pm3d hidden3d 100
#set view 48,32
#set xlabel "Alpha"
#set ylabel "P(H)"
#set zlabel "d"
#set samples 30; set isosamples 30
#set terminal postscript color solid
#set output "fig31a.eps"
#splot [0:1] [0:1] [0:1] abs(x*y/(x*y+0.1*(1-y))-y) 
#epstopdf fig31a.eps

#unset hidden3d
#unset hidden
#unset surface
#unset colorbox
#unset key
#set pm3d
#set style line 100 lt 5 lw 0.5
#set pm3d hidden3d 100
#set view 48,32
#set xlabel "Alpha"
#set ylabel "P(H)"
#set zlabel ""
#set samples 30; set isosamples 30
#set terminal postscript color solid
#set output "fig31b.eps"
#splot [0:1] [0:1] [0:1] abs(x*y/(x*y+0.5*(1-y))-y) 
#epstopdf fig31b.eps

#unset hidden3d
#unset hidden
#unset surface
#unset key
#set pm3d
#set style line 100 lt 5 lw 0.5
#set pm3d hidden3d 100
#set view 48,32
#set xlabel "Alpha"
#set ylabel "P(H)"
#set zlabel ""
#set samples 30; set isosamples 30
#set terminal postscript color solid
#set output "fig31c.eps"
#splot [0:1] [0:1] [0:1] abs(x*y/(x*y+0.9*(1-y))-y) 
#epstopdf fig31c.eps
@ 

\begin{landscape}
\begin{figure}[ht]
\begin{minipage}[b]{0.3\linewidth}
\centering
\includegraphics[width=\textwidth,angle=-90]{fig31a}
\subcaption{$\beta = 0.1$}
\end{minipage}
\hspace{1cm}
\begin{minipage}[b]{0.3\linewidth}
\centering
\includegraphics[width=\textwidth,angle=-90]{fig31b}
\subcaption{$\beta = 0.5$}
\end{minipage}
\hspace{1cm}
\begin{minipage}[b]{0.3\linewidth}
\centering
\includegraphics[width=\textwidth,angle=-90]{fig31c}
\subcaption{$\beta = 0.9$}
\end{minipage}
\caption[The Informativeness of Data Depending on $\beta$, $\alpha$, and $P(H)$]{Figures showing $d(H|D)$ for varying values of $\beta$.}
\end{figure}
\end{landscape}

Figures 3.1a-c show three graphs of $d(H|D)$ with different values of $\beta = \{0.1, 0.5, 0.9\}$.  It can be seen that one learns more when $P(D|H)=\alpha$ and $P(D|\neg H)=\beta$ are farther away from each other.  That is, one learns more when the probability of the data under our hypothesis is very different from the probability of the data under other hypotheses, regardless of whether the hypotheses themselves are likely or unlikely a priori.  The greater the difference, the more the data suggest one hypothesis over another, and the more we learn.  This pattern does not substantially change depending on the prior probabilities of the hypotheses.  This is intuitive.

What is not intuitive is that the rate of change is greater when $\alpha < \beta$ than $\alpha > \beta$.  When $\alpha < \beta$, the graph is always downward sloping, meaning less probable data lead to more change in belief than more probable data.  When $\alpha > \beta$, the graph is always upward sloping, meaning more probable data lead to more information than less probable data.  However, the rates are asymmetric, when $\alpha < \beta$ the slopes are much sleeper than when $\alpha > \beta$.

What is the meaning of this asymmetry?  This is merely due to the effect of the numerator or denominator on the value of a ratio.  Suppose I have the function $f=\frac{y}{x}$.  If I take $y=x=1$, then decrement $x$ by 0.9 then $f=10$, however, if I increment $y$ by 0.9 then $f=1.9$.  This asymmetry is thus merely due to the choice of numerator or denominator, or, more relevant here, whether I choose $H$ or $\neg H$ for the numerator.

The log-likelihood ratio does not have this sensitivity.  Thus $DDC1$ is true only in an arbitrary sense that I've chosen to put one hypothesis in the numerator over another, or that I've chosen not to use the log-likelihood ratio.

If $DDC1$ were properly translated in terms of the general Bayesian analysis described above, it would read as follows, and be correct: ``I learn more when data are very likely under my hypothesis, and very unlikely under alternative hypotheses, than when the data are equally likely under both hypotheses.  Alternatively, I learn more when data are very unlikely under my hypothesis and very likely under alternative hypotheses, than when the data are equally likely under both hypotheses.  This relationship is perfectly symmetric.''

\subsection{Differential Diagnosticity Conjecture 2}  
Now, let's examine a second form of the $DDC$: data that are high probability under a hypothesis, that allows rejection of the complement of this hypothesis, are more informative than data that are low probability, that do not allow rejection of the complement.  Or, more simply, data that reach statistical significance are more informative than data that do not reach statistical significance ($DDC2$).  

One can formalize $DDC2$ with a hybrid Bayesian-Neyman-Pearson approach, which assigns probability to data given a hypothesis based on a rejection region, but also allows assigning probability to hypotheses directly.  Suppose that some person proposes some hypothesis $H$.  Consider two cases.  When receiving affirmation, the data are in the rejection region for $\neg H$ (usually the null hypothesis), meaning the data have low probability under $\neg H$.  Denote this $Data\in RR$ or ``data in the rejection region.''  When receiving disconfirmation, the data are not in the rejection region for $\neg H$, indicating they have high probability for $\neg H$.  Denote this $Data\in \neg RR$.  Also assume both hypotheses have positive probability less than one and have non-zero Lebesgue measure. 

If one substitutes conventional rejection rules, then $P(Data\in RR|\neg H)=\alpha$, where $\alpha$ is the usual significance level ($0.05$) for Type 1 Error, and $P(Data\in RR|H)=\beta$, where $\beta$ is the Type 2 Error.  The parallel conditional probabilities and Neyman-Pearson interpretation are summarized in the table below:

\begin{table}[h]
\begin{center}
\begin{tabular}{c c c}
Variables & Conditional Probability & Neyman-Pearson\\ \hline
$1-\beta$ & $P(Data\in RR|H)$ & Power\\
$\alpha$ & $P(Data\in RR|\neg H)$ & Type 1 Error\\
$\beta$ &  $P(Data\in \neg RR|H)$ & Type 2 Error\\
$1-\alpha$ &  $P(Data\in \neg RR|\neg H)$ & Negative Predictive Value\\
$x$ & $P(H)$ & Probability of Hypothesis\\ \hline
\end{tabular}
\end{center}
\end{table}

To tell whether affirmation is more informative than disconfirmation, we look at the ratio of the $d(H|D)$ measures given affirmation or disconfirming evidence.

\begin{equation}
  \frac{d(H|conf)}{d(H|disc)}=\frac{\|P(H|Data \in RR)-P(H)\|_{1}}{\|P(H|Data \in \neg RR)-P(H)\|_{1}}
\end{equation}

With this setup, if the ratio of the differences in equation 3.3 is greater than 1, then affirmation is more informative than disconfirmation, and if the ratio is less than 1, then the opposite holds.  Thus, substituting the variables for conditional probabilities, the ratio of the differences is as follows:

\begin{equation}
  \frac{d(H|Data \in RR)}{d(H|Data \in \neg RR)}=\frac{\|\frac{1-\beta}{(1-\beta)x+\alpha(1-x)}-1\|_{1}}{\|\frac{\beta}{\beta x + (1-\alpha)(1-x)}-1\|_{1}}=\|\frac{x(\alpha+\beta-1)+1-\alpha}{x(\alpha+\beta-1)-\alpha}\|_{1}
\end{equation}

From this affirmation is more informative than disconfirmation whenever:

\begin{equation}
  \|\frac{x(\alpha+\beta-1)+1-\alpha}{x(\alpha+\beta-1)-\alpha}\|_{1}>1
\end{equation}

I eliminate corner solutions where $x \in {0,1}$.  The result is that affirmation and disconfirmation are equally informative whenever the following equation holds:

\begin{equation}
  \beta=\frac{2\alpha-2\alpha x + 2x -1}{2x}
\end{equation}

Affirmation is more informative than disconfirmation whenever the following inequality holds:

\begin{equation}
  \beta>\frac{2\alpha-2\alpha x + 2x -1}{2x}
\end{equation}

The graph of this inequality is shown in Figure 3.2.  From both the equation and the graph, it is clear that as $P(H)=x$ increases, $\beta$ must be higher (power must be lower) for $DDC2$ to hold.  As $\alpha$ decreases, $\beta$ must be lower (power must be higher) for $DDC2$ to hold.

<<fig32,echo=false,results=hide,fig=false>>=
#unset hidden3d
#unset hidden
#unset surface
#unset key
#set pm3d
#set style line 100 lt 5 lw 0.5
#set pm3d hidden3d 100
#set view 50,220
#set xlabel "P(H)"
#set ylabel "Alpha"
#set zlabel "Beta"
#set samples 30; set isosamples 30
#set terminal postscript color solid
#set output "fig32.eps"
#splot [0:1] [0:1] [0:1] (2*y-2*y*x+2*x-1)/2*x
#epstopdf fig32.eps
@ 

\begin{figure}[h]
\centering
  \includegraphics[width=0.8\textwidth,angle=-90]{fig32}
  \caption[Region Where Affirmation is more Informative than Disconfirmation.]{Graph showing when affirmation is more informative than disconfirmation.}
\end{figure}

For most social scientists, $\alpha$ is fixed at 0.05, but power varies.  This special case can be worked out.  The following equation holds:

\begin{equation}
  \beta>0.95-\frac{0.45}{x}
\end{equation}

The graph of this function is shown in Figure 3.3.  As can be seen, $DDC2$ holds under the following conditions: when $P(H)=0.47$, $\beta$ must be greater than zero.  When $P(H)=1$, $\beta$ must be greater than 0.5.  If $P(H)<0.47$, $\beta$ can be any value. So, if $\beta$ is between 0 and 0.5, affirmation can be more informative than disconfirmation.  For any fixed alpha level, \emph{the higher the value of or prior beliefs, the less likely one is to get more information from affirmation than disconfirmation.}  It generally does not hold when $\beta$ is very low and $P(H)$ is very high.

<<fig33,echo=false,results=hide,fig=false>>=
#set xlabel "P(H)" font "Times,20"
#set ylabel "Beta" font "Times,20"
#set size 0.8, 0.8
#set terminal postscript color solid
#set output "fig33.eps"
#plot [0:1] [-1:1] 0.95-0.45/x with filledcurve x2 lc rgb "blue",0.95-0.45/x with filledcurve y1 lc rgb "blue",0 lc rgb "black", 0.5 with filledcurve x2 lc rgb "blue"
#epstopdf fig33.eps
#set key outside
@ 

\begin{figure}[h]
\centering
  \includegraphics[width=0.6\textwidth,angle=-90]{fig33}
  \caption[Region Where Affirmation is more Informative than Disconfirmation for $\alpha=0.05$]{Graph of $\beta>0.95-\frac{0.45}{x}$ with shaded region where affirmation is more informative than disconfirmation for $\alpha=0.05$.}
\end{figure}

Let's pick a few values for power to examine this further.  Substituting Type 2 Error for power:

\begin{equation}
  Power<0.05+\frac{0.45}{x}
\end{equation}

Rearranging:

\[
d(H|conf)>d(H|disc)\leftrightarrow \left\{
\begin{array}{l l}
  x<\frac{0.45}{Power-0.05} & \quad \text{if $Power>0.05$}\\
  x>\frac{0.45}{Power-0.05} & \quad \text{if $Power<0.05$}\\
\end{array} \right\}
\]

If Power is close to 1, then $P(H)$ must be less than 0.5.  If Power is close to 0.5, then $P(H)$ must be less than 1.  If Power is less than 0.5, then $DDC2$ always holds. Since Power and $P(H)$ are usually low, $DDC2$ is likely to hold.  The $DDC2$ has more validity than $DDC1$ in the Neyman-Pearson world we live in.

Why is this different than $DDC1$?  The Neyman-Pearson approach assigns the lower error rate, $\alpha$ to the higher value hypothesis.  In this sense, one cannot just reverse $H$ and $\neg H$ to show that the converse also holds; one cannot switch $H$ and $\neg H$ because we've assigned them different error rates based on their value.

In sum, if we consider $H$ and $\neg H$ symmetric, then the differential diagnosticity conjecture ($DDC1$) has no meaning; \emph{in general, data are more informative if they are assigned very different probabilities by different hypotheses, and this relationship is symmetric}.  On the other hand, if one uses the Neyman-Pearson approach, and requires errors to be smaller for $H$ than $\neg H$, then the differential diagnosticity conjecture ($DDC2$) holds for the cases social scientists usually face.  As a result, $DDC2$ is a logical (although not particularly good) defense of discarding disconfirming data.

\section{Blaming the Method}

An alternative reason for throwing out disconfirming results is that a disconfirmation is more likely to be an error than an affirmation.  I call this the \emph{Blaming the Method Conjecture} ($BMC$).  I again use the Neyman-Pearson approach allowing for prior probabilities or base rates.  

The following table shows the behavioral commitment to make the judgment that $\neg H$ is false when $Data \in RR$ and $\neg H$ is true when $Data \in \neg RR$:

\begin{table}[h]
  \centering
  \begin{tabular}{c c c}
    & $H$ & $\neg H$ \\ \hline
    $Data \in RR$ & A & B \\
    $Data \in \neg RR$ & C & D\\ \hline
    & $P(H)$ & $P(\neg H)$\\ 
\end{tabular}
\end{table}

The Type 1 Error is the probability of the data being in the rejection region given that the statistical hypothesis $H$ is false:

\begin{equation}
  \text{Type 1 Error} = P(Data \in RR|\neg H) = \alpha = \frac{B}{B+D}
\end{equation}

In contrast, this is not the same as the posterior probability of an error given the data are in the rejection region:\footnote{A quick note on why Tversky and Kahneman said that low power studies increase type 1 error \cite{1971belief}.  It can be seen from the equation above that, once one rejects the null hypothesis $Data \in RR$, the expected Type 1 Error is only equal to $\alpha$ if $\alpha P(\neg H)+(1-\beta)P(H)=P(\neg H)$.  The lower the power $(1-\beta)$, the higher the Type 1 Errors among studies that reject the null hypothesis $Data \in RR$.  Overall \cite{overall1969classical} called this \emph{conditional Type 1 Error}.  It is also interesting to note that conditional Type 1 Error also increases when one is less likely to pick correct hypotheses (i.e., increasing in $P(\neg H)$).}

\begin{equation}
  P(error|Data \in RR) = \frac{B}{A+B}= \frac{\alpha P(\neg H)}{\alpha P(\neg H)+ (1-\beta)P(H)}
\end{equation}

Similarly, Type 2 Error is the probability of the data not being in the rejection region given that the statistical hypothesis $H$ is true:

\begin{equation}
  \text{Type 2 Error} = P(Data \in \neg RR| H) = \beta = \frac{C}{A+C}
\end{equation}

This is, again, not equal to the posterior probability of an error given the data are not in the rejection region, which is equal to:

\begin{equation}
  P(error|Data \in \neg RR) = \frac{C}{C+D}= \frac{\beta P(H)}{\beta P(H)+(1-\alpha)P(\neg H)}
\end{equation}

Thus, this is a confusion of questions and confusion of inverses.  When one says, ``I throw out the data because errors are more likely to occur or disconfirmation than affirmation'' one is saying:

\begin{equation}
  P(error|Data \in \neg RR) > P(error|Data \in RR) \leftrightarrow \frac{B}{A+B} < \frac{C}{C+D}
\end{equation}

This is not equivalent to the Type 1 Error rate being smaller than the Type 2 Error rate:

\begin{equation}
  P(Data \in RR|\neg H)<P(Data \in \neg RR|H) \leftrightarrow \frac{B}{B+D} < \frac{C}{A+C}
\end{equation}

To evaluate the correct question, one must find out when the posterior probability of error is more likely when failing to reject the null hypothesis than when rejecting it:

\begin{equation}
  \frac{P(error|Data \in \neg RR)}{P(error|Data \in RR)}>1
\end{equation}

Rearranging:

\begin{equation}
  \frac{\frac{\beta P(H)}{\beta P(H) + (1-\alpha)P(\neg H)}}{\frac{\alpha P(\neg H)}{\alpha P(\neg H) + (1-\beta)P(H)}}>1 \leftrightarrow \beta (1-\beta) > \frac{\alpha(1-\alpha)(1-P(H))^{2}}{P(H)^{2}}
\end{equation}

Using our usual $\alha = 0.05$, and $P(H)=x$, the conjecture is true if:

\begin{equation}
  0>\frac{0.0475(1-x)^{2}}{x^{2}}-\beta(1-\beta)
\end{equation}

The plot of this graph is shown in Figure 3.4:

<<fig34,echo=false,results=hide,fig=false>>=
#unset hidden3d
#unset hidden
#unset surface
#unset key
#set pm3d
#set style line 100 lt 5 lw 0.5
#set pm3d hidden3d 100
#set view 50,220
#set xlabel "P(H)"
#set ylabel "Beta"
#set zlabel ""
#set samples 30; set isosamples 30
#set terminal postscript color solid
#set output "fig34.eps"
#splot [0:1] [0:1] [-0.3:0] 0.0457*(1-x)**2/x**2-y*(1-y)
#epstopdf fig34.eps
@ 

\begin{figure}[h]
\centering
  \includegraphics[width=0.8\textwidth,angle=-90]{fig34}
  \caption[Region Where Disconfirmation is More Likely to be Error than Affirmation]{Graph showing region where disconfirmation is more likely to be error than affirmation.}
\end{figure}

Two facts can be gleaned from the equation and this graph:
\begin{enumerate}
\item The higher the $P(H)$, the more likely the BMC is to hold.
\item The larger $\|\beta - 0.5\|_{1}$ is, less likely BMC to hold.
\end{enumerate}

Thus, the general conclusion is that: 1) if our hypothesis is rarely true, which is usually the case, then the BMC does not hold, 2) if the power is extremely high or low, then BMC is unlikely to hold.  The BMC depends mostly on the prior probability that the hypothesis is true, and less so on the power of the test.  Since we usually deal with circumstances where the prior probability that the hypothesis is true is low, BMC is usually false.  In general, we should expect more errors when we get affirmation than disconfirmation when we are generally poor at choosing true hypotheses.

\section{Conclusion}

In this chapter I've analyzed two justifications for not sharing disconfirming data.

Section 1 evaluates the conjecture that disconfirming data are not informative as affirming data, and thus they need not be shared.  This is called the differential diagnosticity conjecture (DDC).  In the general case, the DDC is false: data are more informative if they are assigned very different probabilities by different hypotheses, and this relationship is symmetric.  However, if one is unwilling to treat hypotheses symmetrically, then affirmation is more informative than disconfirmation.  

Section 2 evaluates a different reason for not sharing disconfirming data: that disconfirming data are more likely to be error than affirming data.  Using the Neyman-Pearson hypothesis testing framework, this is shown to be false in the cases social scientist usually face, where the probability of picking a true hypothesis is low.  That is, in general we should expect more errors when we get affirmation than disconfirmation when we are generally poor at choosing true hypotheses.

\part{Descriptive}
\section*{Introduction to the Descriptive Analysis}

In Part Three of the dissertation, the descriptive analysis examines whether people behave according to standards set forth in the normative analysis.  Chapter Two concluded that, although there is no logical ground for determining whether data or theory is faulty when they conflict, data sharing policies that omit disconfirming data are unethical because they impose conventions on the reader, thus deceiving them.  In the descriptive analysis, Chapter Four complements Chapter Two by examining whether lay participants judge that surprising disconfirmations are not worthy of being published because they are attributed to error, thus privileging theory over data and imposing conventions on readers.

Chapter Three tells us that when the value of hypotheses is symmetric and the error probabilities of false positives and false negatives are equal, affirmation and disconfirmation provide the same information.  In addition, one should expect more errors from affirmation than disconfirmation when one is generally poor at choosing true hypotheses.  Chapter Five complements Chapter Three by evaluating whether lay participants adhere to these normative principles.  Using the Wason 2-4-6 rule discovery task, participants are given a known rate of error in feedback for hypotheses they test, and are given the chance of sharing the data they collect with another participant also trying to solve the rule.  If they are rational, then they should attribute error to feedback whenever they strongly believed their hypothesis a priori and the feedback was disconfirming, or if they strongly disbelieved their hypothesis a priori and the feedback was affirming.

The descriptive research also has a reflexive or `meta' purpose.  The approach and experiments described in Part Three reflect my problems and training along with their natural evolution.  As a result, the research itself is an example of the phenomenon to be described, as the experiments frequently propose hypotheses, fail in their predictions, and then invoke error to explain the failure.  By making this process transparent, not only in the behavior of the subjects of the experiments but also in the experimenter, others can learn from both the method and results.  The research exposes the not-frequently-discussed but very important elements that lead to the file-drawer problem, where `pre-tests' and `pilot-tests' are flexibly defined and reported with the benefit of hindsight and potentially distorted by pressures to publish.  If this process is hidden, it cannot be addressed, discussed, and improved.  

\chapter{Surprises, Error, and Data Sharing}
\section{Introduction}

Every experiment has the potential for unexpected results---otherwise it would not be worth conducting \footnote{We thank the late Robyn Dawes for reminding us of this principle.}.  When surprises arise, scientists need to account for them.  Those results may suggest new theories.  Or, they may just raise questions about the soundness of the experimental design---and the auxiliary hypotheses needed to interpret the data that it produces \cite{lakatos1980methodology}.  In psychology, those questions might include when research participants understand the instructions and stimuli as intended, whether the set-up conveyed unintended clues or incentives, and whether mistakes were made in data entry or statistical analysis.  The weaker the empirical or theoretical support for these assumptions, the more the interpretation of unexpected results must rely on scientific judgment \cite{fischhoff1999construal}.

Researchers' confidence in that judgment should be shaken by knowing that they already answered these questions as best they could when designing a study.  The need to make such inferences acknowledges that every study requires an assessment of construct validity, as researchers simultaneously evaluate their substantive theories and their methodological assumptions \cite{shadish2002experimental}.  Unexpected results require particularly judicious assessments.  If researchers accept those results uncritically, then they may allow flawed methods to undermine valuable theories.  If researchers challenge the methods hypercritically, then they may unreasonably defend flawed theories.  

The history of physics provides a famous example of making progress by discounting surprising experimental results.  While attempting to measure the charge of an electron, Nobel laureate R.A. Millikan discarded multiple unexpected data points, confidently attributing them to error in his experimental apparatus.  Most of those instances occurred during an ambiguously defined ``warm-up period'' where he ``gradually refined his apparatus and technique in order to make the best measurements.'' \cite[p. 13]{goodstein2000defense}  However, Millikan also rejected later (post-warm-up) observations where ``there were no obvious experimental difficulties that could explain the anomaly.''  He attributed these anomalies to nothing more explicit than ``something wrong with the thermometer.'' \cite[p. 13]{franklin1997millikan}  Later work found that Millikan's intuitions were generally right, even though he did not articulate reasons for them---and, indeed, could not have known the source of the anomalies given scientific knowledge at the time.  (His experimental apparatus was unreliable with charges greater than 30e.)  Had Millikan pursued the anomalies, he would have delayed studies that made important contributions to physics, despite their flaws.  

As in Millikan's case, it may be necessary to ``explain away\ldots odd results'' to avoid having research ``instantly degenerate into a wild-goose chase after imaginary fundamental novelties.'' (Michael Polanyi quoted by \cite[p. 63]{gorman1992simulating}).  Psychological research has identified processes that can support and undermine such judgments.  On the one hand, surprising results can induce a greater subjective need for better explanations, prompting deeper probing and reflection \cite{roese1996counterfactuals}.  On the other hand, such results can prompt ``explaining away'' results that disconfirm favored theories by unfairly attacking auxiliary hypotheses \cite{kunda1990case}.  

In a less happy example from physics, Rene Blondlot's purported discovery of a new type of electromagnetic radiation, called n-rays, ``touched off a wave of self-deception that took years to subside.'' \cite[p. 170]{klotz1980n}  His supporters included respected physicists who uncritically reported expected effects when they placed n-ray sources (e.g., gas burners used for lighting, heated silver or sheet iron) in front of electric spark generators, while accusing scientists who failed to observe those effects \cite{wood1904n} of poor training.  

Both Millikan and Blondlot attributed unexpected results to measurement error.  Such \emph{error model}\footnote{More technically, an error model is a causal explanation that renders the substantive theory conditionally independent of the data when invoked, thus making the data not informative for the theory.} explanations include attributing unexpected data to uncontrolled, unintended, or unknown experimental artifacts. Error models can capture valid intuitions and keep science moving until deeper understanding arrives, as with Millikan.  However, as in the case of Blondlot and his supporters, error models can also immunize hypotheses against valid challenges from disconfirming data \cite{gorman1992simulating,gorman2005scientific,gorman1989error,penner1996trust}.  Psychological processes that could enable error-model thinking include blaming the method \cite{dunbar2001scientific}, biased assimilation \cite{lord1979biased}, confirmation bias \cite{klayman1989hypothesis}, and belief perseverance \cite{nickerson1998confirmation}.    

Error models can be created in foresight (for potential surprises) or hindsight (for actual ones).  The voluminous research on hindsight bias \cite{blank2008many,christensen1991hindsight} suggests that the two perspectives will produce rather different error models.  In hindsight, explanations will naturally focus on the observed outcome, whereas foresight will consider possible outcomes \cite{slovic1977psychology}.  Accounts of hindsight bias can be derived from many theories of human memory, judgment, and formal reasoning, including mental model ``rejudgments'' \cite{hawkins1990hindsight}, q-morphisms \cite{holland1989induction}, sense-making \cite{pezzo2003surprise}, causal judgment \cite{roese1997counterfactual}, Bayes nets \cite{koller2009probabilistic}, and causal models \cite{glymour2003learning,griffiths2009theory}.  In general, the psychological evidence on hindsight bias echoes the creeping determinism account originally proposed by Fischhoff \cite{fischhoff1975hindsight}, in which learning about an outcome modifies one's prior beliefs to make it seem more likely.  Generally speaking, the more unexpected the outcome, the stronger this sense-making process and the greater the resultant bias will be ---unless the surprise is so extreme or obviously random that one cannot generate an acceptable causal explanation \cite{pezzo2003surprise}.

Treating expected and unexpected results differently creates the risk of accepting weak, but welcome results uncritically, while learning too little from potentially informative surprises---leading to ``well-intentioned scientists making well-intentioned (although biased) decisions\ldots leading to incorrect results'' \cite[p. 58]{spellman2012introduction}.  The Large Interferometer Gravitational-wave Observatory (LIGO) represents one ambitious attempt to reduce that risk.  Members of this ``big science'' project specify a priori rules for removing spurious data prior to statistical analysis, so that these decisions are not unduly affected by the data themselves \cite{christensen2004vetoes,christensen2005veto}.  Those rules seek to balance those scientists' desire to include as many of their (very expensive) observations as possible, while excluding spurious ones that could undo their work.  However, even in that mature science, it is hard to anticipate all possible problems (e.g., a private plane flying into the restricted air space over an interferometer, perturbing the observations), making some post hoc interpretation inevitable. 

Excluding data is straightforward when outright fabrication is discovered \cite{crocker2011addressing}.  It is much harder in the situations usually faced by scientists, where, ``data are not published in good journals, or even in bad journals'' but instead are ``sitting in my file drawer'' \cite[p. 58]{spellman2012introduction}, after being rejected because they ``didn't work [or were] pilot studies'' (p. 58).  

Open-access data advocates argue that all data must be shared, so that the community of scientists can evaluate their relevance directly and discern the ``story of the failures that make the successes possible'' \cite[p. 15]{bradley2007open}.  Some claim that unshared data are ``experimental failures'' \cite[p. 24]{everts2006open}.  Yet, for working researchers to adopt these norms, they need to feel that they are more like Blondlot (potentially mistaken) than Millikan (potential Nobel laureates).  They also need peers who value learning from failures as well as successes.

A priori rules are needed most when the differences are greatest between the error models produced in foresight and hindsight, namely, when the evidence disconfirms researchers' hypotheses, prompting them to generate flexible alternative hypotheses that may overfit the data \cite{kerr1998harking,simmons2011false}.  The present studies examine the role of error models in interpreting and sharing experimental results, focusing on foresight-hindsight differences.   

As a platform for these studies, we use a design introduced by \cite{slovic1977psychology}.  It has participants assess the probability of replicating the initial observation of a hypothetical experiment.  Foresight participants assess that probability for two possible outcomes.  Hindsight participants are told that one of those outcomes was, in fact, observed.  The conditional probability of replication should be the same in both conditions.  However, Slovic and Fischhoff found that hindsight participants see replication as more likely, consistent with being less able to see how the initial study could have turned out otherwise.  We begin by repeating Slovic and Fischhoff's original study, in order to establish a baseline for the following studies, examining how people account for more and less expected results.  Incidentally, we assess the robustness of a widely cited study, thirty-plus years later.

\section{Experiment One}
\subsection{Method}
Participants evaluated the four hypothetical studies presented in Experiment One of \cite{slovic1977psychology}, using their stimulus materials.  These studies tested whether: 1) a virgin rat would exhibit maternal behavior following a blood transfusion from a mother rat, 2) seeding a hurricane with silver-iodide crystals would diminish its wind velocity, 3) goslings would be imprinted on a duck if exposed to its quacking before hatching, and 4) children could take another person's perspective when judging the position of a dot on a large Y.  Foresight participants first assessed the probability of each outcome occurring, then its probability of replication on all, some, or none of 10 additional observations --- should it be observed on a single initial observation.  Hindsight participants were told that one of the two outcomes had occurred, and then assessed its probability of replication.  The design was 4 (study: rat, hurricane, duck, Y-test) by 2 (time: foresight vs. hindsight) by 2 (outcome: A or B) with repeated measures on the first factor and repeated measures on the last factor in the foresight condition, whose participants gave probabilities of replication for both outcomes.  

\subsubsection{Participants}
All 268 participants were paid volunteers who responded to an Amazon Mechanical Turk (MTurk) ad offering them 1 dollar for participation in a 7-minute study.  Mason \emph{et al.} \cite{mason2010financial} found that, when paid more, MTurk participants work longer but do not perform better (in terms of accuracy).  Horton \emph{et al.} \cite{horton2011online} found that MTurk participants replicated results from several classic judgments studies originally conducted with traditional (e.g., student) samples.  A two-part attention filter \cite{downs2010your,oppenheimer2009instructional,paolacci2010running} at the beginning of the experiment assessed whether participants were paying attention.  Only the 173 participants who passed both its parts (one easier, one harder) were included in the analysis.  According to participants' reports, their average age was 32 years old (range = 18 -- 81) and 56.6\% were women.  

\subsection{Results}

Table 4.1 reveals a clear hindsight bias in responses to the first hypothetical study.  Foresight participants said that, if the first virgin rat demonstrated maternal behavior (after receiving a blood transfusion from a mother rat), there was a 27.8\% chance of that happening on all 10 subsequent cases.  Hindsight participants told that the initial case had turned out that way gave a 49.4\% probability to consistent replication.  The corresponding means in \cite{slovic1977psychology} were 30\% and 44\%, respectively.  The mean probability of no replications was 32.8\% in foresight and 16.2\% in hindsight (in the previous study, 29\% and 7\%). The other outcome (B) showed complementary results, also fairly similar to those before. 

The other three hypothetical studies revealed similar patterns (Tables 4.2-4.4):  An initial observation was seen as significantly more likely to be replicated consistently when it was reported to have happened (hindsight) that when it was considered as a possibility (foresight).  It was also judged significantly less likely never to be repeated.  In each case, the means were similar to those in \cite{slovic1977psychology} --- although that is not a necessary condition for replicating the pattern of responses.

\begin{table}[h]
  \caption[Judged Replication Probabilities for Virgin Rat Study]{Judged probability that the initial observation will replicate in all, some, or none of 10 replication trials for virgin rat study.}
  \begin{tabular}{c c c c}
    & & Foresight & Hindsight \\ 
    Outcome & Response & M (SD, N) & M (SD, N)\\ \hline
    \multirow{3}{*}{Outcome A (maternal behavior)}  & All & 28 (29, 61) & 49 (32, 62)\\
    & Some & 39 (29, 61) & 34 (28, 62)\\
    & None & 33 (32, 61) & 16 (20, 62)\\
    \multirow{3}{*}{Outcome B (no maternal behavior)}& All & 47 (37, 61) & 67 (34, 50)\\
    & Some & 34 (29, 61) & 23 (28, 50)\\
    & None & 20 (24, 61) & 10 (20, 50)\\ \hline
\end{tabular}
\end{table}

\begin{table}[h]
  \caption[Judged Replication Probabilities for Hurricane Seeding Study]{Judged probability that the initial observation will replicate in all, some, or none of 10 replication trials for hurricane study.}
  \begin{tabular}{c c c c}
    & & Foresight & Hindsight \\ 
    Outcome & Response & M (SD, N) & M (SD, N)\\ \hline
    \multirow{3}{*}{Outcome A (intensity increases)}  & All & 40 (33, 61) & 52 (30, 62)\\
    & Some & 34 (27, 61) & 34 (26, 62)\\
    & None & 27 (28, 61) & 15 (16, 62)\\
    \multirow{3}{*}{Outcome B (intensity decreases)}& All & 37 (33, 61) & 51 (34, 50)\\
    & Some & 36 (28, 61) & 36 (31, 50)\\
    & None & 27 (28, 61) & 14 (18, 50)\\ \hline
\end{tabular}
\end{table}

\begin{table}[h]
  \caption[Judged Replication Probabilities for Gosling Imprinting Study]{Judged probability that the initial observation will replicate in all, some, or none of 10 replication trials for gosling study.}
  \begin{tabular}{c c c c}
    & & Foresight & Hindsight \\ 
    Outcome & Response & M (SD, N) & M (SD, N)\\ \hline
    \multirow{3}{*}{Outcome A (approaches goose)}  & All & 36 (33, 61) & 56 (34, 62)\\
    & Some & 34 (29, 61) & 34 (19, 62)\\
    & None & 30 (32, 61) & 10 (12, 62)\\
    \multirow{3}{*}{Outcome B (approaches duck)}& All & 49 (34, 61) & 73 (32, 50)\\
    & Some & 34 (30, 61) & 19 (24, 50)\\
    & None & 17 (24, 61) & 8 (16, 50)\\ \hline
\end{tabular}
\end{table}

\begin{table}[h]
  \caption[Judged Replication Probabilities for Y-Test Study]{Judged probability that the initial observation will replicate in all, some, or none of 10 replication trials for the Y-test study.}
  \begin{tabular}{c c c c}
    & & Foresight & Hindsight \\ 
    Outcome & Response & M (SD, N) & M (SD, N)\\ \hline\multirow{3}{*}{Outcome A (places dot in area A)}  & All & 45 (34, 61) & 56 (31, 62)\\
    & Some & 37 (31, 61) & 33 (28, 62)\\
    & None & 18 (23, 61) & 10 (12, 62)\\
    \multirow{3}{*}{Outcome B (places dot in area B)}& All & 18 (24, 61) & 31 (28, 50)\\
    & Some & 34 (32, 61) & 46 (31, 50)\\
    & None & 48 (36, 61) & 23 (20, 50)\\ \hline
\end{tabular}
\end{table}

\subsection{Discussion}

Slovic and Fischhoff \cite{slovic1977psychology} found that people see the results of the first observation of a study as more likely to be replicated in hindsight than in foresight.  In this exact replication of their Experiment One, that result held true.  In their Experiment Two, Slovic and Fischhoff \cite{slovic1977psychology} found similar results when foresight participants considered only one of the two possible outcomes, rather than both (as in Experiment One), indicating that their lower confidence in replication was not due to focusing less on each outcome. 

These participants had no natural reason to prefer observing either outcome, unlike actual investigators, who may care deeply about how studies turn out.  However, these participants did have natural expectations, expressed in the probabilities that foresight participants gave for the possible outcomes of the first observation.  As seen in Table 4.5, one outcome was significantly more likely for three of the four hypothetical studies (virgin rat, hurricane, Y-test), whether measured by the mean probability or the percentage of participants assigning a probability greater than 50\%.  

\begin{table}[h]
  \centering
  \caption[Mean Foresight Probabilities for all Studies]{Mean foresight probability and proportion of probabilities greater than 50 for each outcome of the initial observation.  One-sample t-test compares P(A) to 50\%.}
  \begin{tabular}{c c c c c c}
    & P(A) & P(B) & &  & \\
    Study & M (SD) & M (SD) & One-Sample t-test & P(A)$>$50  & P(B)$>$50 \\ \hline
    Virgin Rat & 40 (25) & 59 (26) & \emph{t} (60) = 2.66, \emph{p} = 0.01 & 14/61 & 32/61\\ 
    Hurricane & 52 (27) & 40 (26) & \emph{t} (60) = 2.90, \emph{p} = 0.005 & 29/61 & 12/61\\ 
    Gosling & 51 (28) & 52 (27) & \emph{t} (60) = 0.65, \emph{p} = 0.52 & 28/61 & 25/61\\ 
    Y-test & 65 (28) & 24 (22) & \emph{t} (60) = 4.14, \emph{p} $<$ 0.001 & 39/61 & 5/61\\ \hline
\end{tabular}
\end{table}


From both perspectives, the most likely of the eight outcomes was Outcome A in the Y-test study.  As seen in Tables 4.1-4.4, that outcome also produced the weakest hindsight bias, as though it was so strongly expected that reporting its occurrence had relatively little impact (although it was not so likely as to encounter a ceiling effect).  Conversely, reporting Outcome B in the Y-test study, the least expected of the eight, had a particularly large hindsight effect, indicating willingness to abandon outcome A, given a single contrary observation.  These results are consistent with participants generating causal explanations for explaining whatever they observe, with those more often being error models when they observe the unexpected.  

Experiments Two through Five examine these processes, as revealed in attributions for results of the Y-test study, the one with the most and least expected initial observations.  If participants use error models to accommodate unexpected results, then they should invoke explanations such as ``experimental error'' or ``methodological problems'' more often with Outcome B than with Outcome A.  

\section{Experiment Two}
\subsection{Method}
Experiment Two replicated Experiment One, with two differences:  (a) Participants considered just one hypothetical study, the Y-test, in order to elicit a fuller, more focused set of beliefs.\footnote{Experiment Two originally included all four studies from Experiment One.  However, for the sake of simplicity, we decided to focus on the Y-test results, which used the expected and unexpected outcomes, hence best fit our research interests.}  (b) Participants assessed the probability that each of four causes accounted for the results.  Thus, the design was 2 (foresight vs. hindsight) by 2 (expected outcome [area A] vs. unexpected outcomes [area B]), between-subjects.  

\subsubsection{Participants}
For Experiment Two, all 664 participants were paid volunteers who responded to an Amazon MTurk ad offering them \$1 for participation in a 7-minute experiment.  Experiment Two used the same attention filter as Experiment One, with 468 individuals (70\%) passing both tests. Their average age was 31 years old (range: 18 -- 63); 50\% were women. 

\subsubsection{Materials}
In Experiment Two, all participants received the same introductory instructions as used in Slovic and Fischhoff \cite{slovic1977psychology}, followed by their description of the Y-Test study.

\begin{quote}
In the pretest of an experiment that she intends to run in the future, an experimenter will place a 4-year-old child in front of an easel with a large Y on it, with a dot in the lower left-hand third of the letter. The child will then be taken around to the back of the easel where he will see another Y. He will be asked to draw a dot in the ``same position'' on that Y as the one he had just seen.
\end{quote}

\begin{figure}[h]
\centering
\caption[Image of Y]{Image of Y shown to participants.}
\includegraphics[width=0.3\textwidth]{yfig}
\end{figure}

\begin{quote}The possible outcomes are (a) the child places a dot in Area A (the lower left-hand third), (b) the child places a dot in Area B (the upper third), or (c) the child places a dot in Area C (the lower-right hand third).
\end{quote}

Participants were then asked for predictions and attributions (using the Area condition as an example below).  The brackets provide our interpretation of each response. 

\emph{Foresight}.
\begin{quote}
  If the child places a dot in Area A, what is the probability that:(Note: These four probabilities should total 100\%.)
  \begin{enumerate}
  \item The child's understanding of the experimenter's instructions caused the child to place the dot in Area A. [Valid Method.]
  \item Some error in the experiment caused the child to place the dot in Area A. [Invalid Method.]
  \item Random chance caused the child to place the dot in Area A. [Chance]
  \item There was some other cause not already mentioned above. [Other]
  \end{enumerate}
\end{quote}

\emph{Hindsight}.
The instructions for hindsight participants differed in reporting the first observation:
\begin{quote}
  Result: The child placed a dot in Area A (the lower left-hand third).
\end{quote}

\subsection{Results}

Table 4.6 shows median probabilities assigned to the four causal explanations.\footnote{We also asked  exploratory questions not reported here, regarding participants' overall judgments of the strength of the experimental design and how the results should be treated.}   Based on the results of Experiment One (and \cite{slovic1977psychology}), we treat area A as expected and area B as unexpected.  We use medians rather than means because of skewed distributions and outliers.  We conducted median regressions \cite{koenker2009quantreg,wooldridge2009introductory} of the two experimental factors on the probabilities assigned to the four potential causes, using non-parametric bootstrap to estimate standard errors \cite{efron1993introduction}.

\begin{table}[h]
  \caption[Experiment Two Causal Attributions]{Median probability, standard error, and sample size for four causal attributions for Experiment Two. $Md=\text{median}$.  Sample sizes are ($H=\text{hindsight}$; $U=\text{unexpected}$): $HU=115$; $FU=122$; $HE=118$; $FE=114$.}
  \centering
  \begin{tabular}{c c c c}
    & & Foresight & Hindsight\\
    Possible Cause & Condition & Md (SE) & Md (SE)\\ \hline
    \multirow{2}{*}{Valid Method} & Expected (A) & 60 (7.7) & 60 (7.5)\\
    & Unexpected (B) & 30 (5.4) & 50 (3.9)\\
    \multirow{2}{*}{Invalid Method} & Expected (A) & 5 (2.5) & 4.5 (1.5)\\
    & Unexpected (B) & 10 (1.8) & 10 (1.2)\\
    \multirow{2}{*}{Chance} & Expected (A) & 20 (2.1) & 10 (1.6)\\
    & Unexpected (B) & 20 (2.3) & 20 (2.2)\\
    \multirow{2}{*}{Other} & Expected (A) & 8.5 (2.3) & 9.5 (2.3)\\
    & Unexpected (B) & 20 (3.6) & 12.5 (3.9)\\ \hline
\end{tabular}
\end{table}

For each cause, results were generally in the predicted direction, although not always significantly so.  The experimental method was judged more valid when the initial observation was expected rather than unexpected (i.e., the child placed the dot in area A), in both foresight (60\% vs. 30\%) and hindsight (60\% vs. 50\%).  The corresponding main effect for the difference between the probability assigned to Valid Cause in the expected and unexpected conditions (difference = -30; 95\% CI: [-50, -10]) was statistically significant, \emph{t} (464) = 3.00, p $<$ 0.05, d $=$ 0.14.   There was also a non-significant interaction, with unexpected evidence reducing the probability assigned to Valid Method more in foresight than hindsight (difference = 20; 95\% CI: [-6, 46]), \emph{t} (464) $=$ 1.53, \emph{p} $>$ 0.05, d $=$ 0.07.  

Conversely, participants assigned similar probabilities to the method being Invalid after an unexpected observation than after an expected one, in both foresight (5\% vs. 10\%) and hindsight (4.5\% vs. 10\%), a non-significant main effect (difference = 5.0; 95\% CI: [-1.3, 11.3]), \emph{t} (464) $=$ 1.59, \emph{p} $>$ 0.05, \emph{d} = 0.07; nor was there the expected interaction (with larger effects in hindsight), \emph{t} (464) $=$ 0.00, \emph{p} $>$ 0.05, \emph{d} $=$ 0.00. 

Chance was assigned a greater role with unexpected results in hindsight (20\% vs. 10\%), but not in foresight (20\% vs. 20\%), reflected in both a main effect of hindsight (difference = -10; 95\% CI: [-14.9, -5.1]), \emph{t} (464) $=$ 4.08, \emph{p} $<$ 0.05, \emph{d} $=$ 0.19, and an interaction between the two factors \emph{t} (464) = 2.11, \emph{p} $<$ 0.05, \emph{d} = 0.10.  Finally, Other Causes received higher probabilities with unexpected results, in both foresight (20\% vs. 8.5\%) and hindsight (12.5\% vs. 9.5\%), with a significant main effect (difference = 10; 95\% CI: [-0.9, 19.1]) \emph{t} (464) = 2.20, \emph{p} $<$ 0.05, \emph{d} $=$ 0.10; but no interaction.

\subsection{Discussion}

As predicted by the assumption that people rely on error models to explain unexpected outcomes, participants who considered the less expected result (the child placing the dot in area B) assigned significantly lower probabilities to the method being valid and significantly higher probabilities to chance and to other causes.  They did not assign significantly higher probabilities to an invalid method, although there was a trend in this direction.  These patterns were observed in both foresight and hindsight, except for one significant interaction: chance was assigned a greater role for unexpected results in hindsight, but not foresight.  Thus, it appears that people can invoke error model thinking in foresight as well as they can in hindsight---if asked to do so. 

One possible interpretation of these results is that error models are equally available in foresight and hindsight---if people explicitly consider how they will account for an unexpected result.  However, that assessment may not happen as naturally in foresight.  That could be true for actual researchers, if they do not press as hard as they might in foresight to think about possible confounds, as well as for participants in Experiment One, asked to consider, but not explain, an unexpected first observation.  In contrast, the attribution tasks of Experiment Two embody a kind of debiasing procedure, making potentially useful alternative explanations more available in foresight---and the outcomes less likely.  Without having elicited probabilities of replication in Experiment Two, we cannot know.  Experiment Three does that, adding the probability question from Experiment One to the attribution task of Experiment Two.

We predicted that these trends would be stronger with the unexpected observation, insofar as it creates a greater need to explain the result.  The lack of significant interactions with any of the non-chance causes (Valid Method, Invalid Method, Other Causes) suggests that participants found ways to deal with the unexpected result more thoroughly in hindsight.  

Experiment Three looks more closely at participants' use of error models by having them generate a cause for the initial observation on their own, using an open-ended format.  We code this cause into the four categories of Experiment Two.  After providing that cause, participants perform the attribution task of Experiment Two, with a refinement of the Valid Method category, phrasing more clearly it in terms of the hypothesis guiding the investigator, that the child could rotate the image mentally, assuming that a valid method was needed for it to emerge.  Experiment Three also offers participants an open-ended opportunity to explain their reasoning.  

Finally, we extend the experimental task by asking participants to imagine themselves as the investigators, then say how they would treat the results of a study with responses from an additional 10 children, in terms of whether the data should be published, replicated, or discarded.  If participants believe that an unexpected result is due to error, then they should see it as not worth publishing because it does not properly test the hypothesis---just as Millikan discarded measurements that he thought were contaminated by uncontrolled variation, such as ``something wrong with thermometer.''  

\section{Experiment Three}
Experiment Three changes the methodology of Experiment Two in four ways: (a) Participants generated their own causal explanations before assigning probabilities to pre-defined categories.  (b) We clarified those categories by explicitly offering the non-error explanation (the child rotated the image). (c) We asked participants how they would treat research results in terms of publication, imagining themselves as scientists.  (d) We added a manipulation check.  

\subsection{Method}
Experiment Three followed Experiment Two, with a 2 (foresight vs. hindsight) by 2 (Area: A, B) design.  We added an open-ended attribution task and a question about data sharing and revised the structured attribution question.

\subsubsection{Participants}
All 448 participants were paid volunteers who responded to an Amazon MTurk ad offering \$1 for participation in a 7-minute experiment.  Experiment Three used the same attention filter as before, with 359 (80\%) individuals passing. Fifteen (3\%) failed the manipulation check, indicating area C.  Among the remaining 344 participants, the average age was 32 years old (range: 18 -- 68); 151 were women (44\%).   

\subsubsection{Materials}

The instructions followed Experiment Two, with these modifications. 

For Foresight:
After reading the study's design, participants were asked to explain one potential outcome of the first observation.

\begin{quote}
  Please explain why you think the child could place the dot in Area A [or B]. (open-ended) [OpenCause]
\end{quote}

\begin{flushleft}
  Participants then answered the modified version of the attribution question:
\end{flushleft}

\begin{quote}
  What is the probability that? (Note: These four probabilities should total 100\%.)
\begin{enumerate}
\item The child's ability to mentally rotate the image caused the child to place the red dot in Area A. [Rotate]
\item Some error in the experiment caused the child to place the red dot in Area A. [Invalid Method.]
\item Random chance caused the child to place the red dot in Area A. [Chance]
\item There was some other cause not otherwise mentioned. [Other]
\end{enumerate}
\end{quote}

Participants then answered the probability-of-replication question from Slovic and Fischhoff (1977) and Experiment One.  Participants next answered a new question asking how they would treat those observations:

\begin{quote}
  If the replication of this experiment with 10 additional children comes out the way you expect, which of the following actions would you recommend that the scientist take:
  \begin{enumerate}
  \item Collect more data before publishing [MoreData]
  \item Publish without collecting more data [Publish]
  \item Do not publish any of the data [NoPublish]
  \end{enumerate}
\end{quote}

An open-ended question asked them to explain this recommendation.  Finally, participants completed the following manipulation check: 

\begin{quote}
  Where did the child put the dot?
  \begin{enumerate}
  \item Area A
  \item Area B
  \item Area C
\end{enumerate}
\end{quote}

For Hindsight participants, the tasks were the same, except that they were told, ``The child placed the dot in Area A [or B].''

\subsection{Results}

Table 4.7 shows judgments of the three replication possibilities.  As in Experiment Two, we used median regression with non-parametric bootstrapped standard errors for statistical tests.

Experiment Three replicates the previously observed hindsight effect, but only for the expected outcome.  Participants told that the first child had placed the dot in the expected place (A) gave higher probabilities to that happening on the next 10 observations than did participants who considered that outcome as a possibility (50\% vs. 30\%).  Consistent replication of the unexpected result (B) was, however, equally likely in hindsight and foresight (10\% vs. 10\%).  The corresponding interaction was marginally significant (difference = -20; 95\% CI: [-43, 2.6]), \emph{t} (339) = 1.77, \emph{p} = 0.08, \emph{d} = 0.10.  Conversely, the expected result was judged less likely never to replicate (on the next 10 observations) in hindsight that in foresight (1\% vs. 7\%), whereas the unexpected result was judged more likely never to replicate once it had been observed than when it was just a possibility (25\% vs. 20\%). Here, too, though, the interaction was not statistically significant (difference = 10; 95\% CI: [-3, 23]), \emph{t} (339) = 1.57, \emph{p} = 0.12, \emph{d} = 0.08. 

\begin{table}
  \caption[Experiment Three Replication Probabilities]{Median probability (Md) and standard error (SE) for expected replication for Experiment Three. Sample sizes are ($H=\text{hindsight}$; $U=\text{unexpected}$): $FU=71$; $HU=87$; $HE=101$; $FE=84$.}
  \centering
  \begin{tabular}{c c c c}
    & & Foresight & Hindsight\\
    Expected Replication & Condition & Md (SE) & Md (SE)\\ \hline
    \multirow{2}{*}{All} & Expected (A) & 30 (8.2) & 50 (7.5)\\
    & Unexpected (B) & 10 (2.4) & 10 (2.9)\\
    \multirow{2}{*}{Some} & Expected (A) & 45 (9) & 30 (5.5)\\
    & Unexpected (B) & 50 (5.8) & 50 (5.8)\\
    \multirow{2}{*}{None} & Expected (A) & 6.6 (2.2) & 1 (1.6)\\
    & Unexpected (B) & 20 (4.1) & 25 (4.8)\\ \hline
\end{tabular}
\end{table}

\subsubsection{Data Sharing Judgments}

As seen in the top section of Table 4.8, participants overwhelmingly recommended collecting more data before publishing, for both expected and unexpected results.  Among the minority who recommended publishing, the rate was twice as high with expected results than with unexpected ones, although the difference was not significant.  Conversely, not publishing was more common with unexpected results; again, not significantly so.  These patterns were the same in hindsight and foresight (i.e., with no significant interactions).

\begin{table}
  \caption[Experiment Three Data Sharing Judgments]{Mean number of participants choosing each category.  Publish judgments could be in one of three categories, which can be modeled using a Dirichlet distribution.  Standard errors generated from 10000 simulations from a posterior Dirichlet distribution \cite{martin2011mcmcpack} with improper Dirichlet(0,0,0) priors.}  
  \centering
  \begin{tabular}{c c c c c}
    & & Foresight & Hindsight\\
    Publishing Recommendation & Condition & \% (95\% CI) & \% (95\% CI)\\ \hline
    \multirow{2}{*}{More Data} & Expected (A) & 0.79 [0.69, 0.87]  & 0.82 [0.74, 0.89]\\
    & Unexpected (B) & 0.85 [0.75, 0.92]  & 0.83 [0.75, 0.90]\\
    \multirow{2}{*}{Publish} & Expected (A) & 0.18 [0.11, 0.27]  & 0.17 [0.10, 0.25]\\
    & Unexpected (B) & 0.09 [0.03, 0.16]  & 0.09 [0.04, 0.16]\\
    \multirow{2}{*}{No Publish} & Expected (A) & 0.04 [0.01, 0.09]  & 0.01 [0.00, 0.04]\\
    & Unexpected (B) & 0.07 [0.02, 0.14]  & 0.08 [0.03, 0.14]\\ \hline
\end{tabular}
\end{table}

\subsubsection{Causal Attributions}

Table 4.9 shows the probabilities assigned to the four explanations of the initial observation.  Rotate attributes the first observation to the child's having (or lacking) the ability to rotate the display mentally (as revealed by a valid method).  As expected, the probabilities assigned to that explanation were higher when results were consistent with that ability (A vs. B), in both foresight (58\% vs. 20\%) and hindsight (67\% vs. 25\%).  The main effect (difference = -40; 95\% CI: [-55, -25]) was statistically significant, \emph{t} (339) = 5.30, p $<$ 0.05, \emph{d} = 0.29.  However, that difference was not significantly greater in hindsight (difference = -7; 95\% CI: [-24, 10]), \emph{t} (339) = 0.80, \emph{p} $>$ 0.05, \emph{d} = 0.04.

\begin{table}[h]
  \caption[Experiment Three Causal Attributions]{Median probability (Md), standard error (SE) for four causal attributions for Experiment Three. Sample sizes are ($H=\text{hindsight}$; $U=\text{unexpected}$): $FU=71$; $HU=87$; $HE=101$; $FE=84$.}
  \centering
  \begin{tabular}{c c c c}
    & & Foresight & Hindsight\\
    Possible Cause & Condition & Md (SE) & Md (SE)\\ \hline
    \multirow{2}{*}{Rotate} & Expected (A) & 58 (6.8) & 67 (5.2)\\
    & Unexpected (B) & 20 (3.5) & 25 (2.8)\\
    \multirow{2}{*}{Invalid Method} & Expected (A) & 5 (1.6) & 1 (1.4)\\
    & Unexpected (B) & 10 (3.1) & 10 (2.6)\\
    \multirow{2}{*}{Chance} & Expected (A) & 20 (2.6) & 10 (2.1)\\
    & Unexpected (B) & 25 (3.2) & 20 (2.0)\\
    \multirow{2}{*}{Other} & Expected (A) & 10 (2.3) & 5 (2.2)\\
    & Unexpected (B) & 20 (3.4) & 20 (2.8)\\ \hline
\end{tabular}
\end{table}

Although there was a trend for participants to see the method as Invalid after an unexpected observation than after an expected one, in both foresight (5\% vs. 10\%) and hindsight (1\% vs. 10\%), this main effect (difference = 5.0; 95\% CI: [-1.6, 11.6]) was not significant \emph{t} (339) = 1.54, \emph{p} $>$ 0.05, \emph{d} = 0.08; nor was the interaction corresponding to the weakly greater effect in hindsight, \emph{t} (339) = 0.88, \emph{p} $>$ 0.05, \emph{d} = 0.05.  Chance was evoked less for unexpected results in hindsight (15\%) than in foresight (20\%), a main effect  (difference = -10; 95\% CI: [-16, -4]), \emph{t} (339) = 3.30, \emph{p} $<$ 0.05, \emph{d} = 0.18.  Other Causes were invoked more with unexpected results, in both foresight (20\% vs. 10\%) and hindsight (20\% vs. 5\%), with a significant main effect (difference = 10; 95\% CI: [1.6, 18.4]) \emph{t} (339) = 2.40, \emph{p} $<$ 0.05, \emph{d} = 0.13, but no interaction.

We coded participants' open-ended explanations into the four categories of the structured attribution questions, adding a category for uninformative responses (e.g., ``the child placed the dot,'' ``I don't know why'').  Table 4.10 shows typical examples.  Other Cause explanations implied a valid method that revealed a different process than Rotate.  

\begin{table}[h]
  \caption[Experiment Three Open-Ended Categories]{Causal categories coded from open-ended responses.}
  \centering
\scalebox{0.9}{
  \begin{tabular}{c c l}
    Category & Subcategory & Examples\\ \hline
    \multirow{6}{*}{Rotate} & \multirow{4}{*}{Spatial Rotation} & The child was unable to mentally rotate the image.\\
    & & The child has bad spatial rotation. \\
    & & The child is able to mentally rotate the image.\\
    & & The child has good spatial rotation.\\
    & \multirow{2}{*}{Perspective-Taking} & The child placed the dot based on his point of view.\\
    & & The child responded to the relative or absolute position.\\ \hline
    \multirow{5}{*}{Invalid Method} & \multirow{2}{*}{Faulty Task} & The instructions were ambiguous.\\
    & & The task was confusing. \\
    & \multirow{3}{*}{Faulty Child} & The child was not paying attention. \\
    & & The child was too young to understand instructions. \\
    & & The child's brain is not developed. \\ \hline
    \multirow{3}{*}{Chance} & & The child placed the dot randomly. \\
    & & The child guessed. \\
    & & The child placed the dot based on luck. \\ \hline
    \multirow{8}{*}{Other Causes} & \multirow{2}{*}{Ambiguous} & That's where the dot was on the other side. \\
    & & The child placed the dot in the same place. \\
    & \multirow{4}{*}{Task-Child Interaction} & The child places the dot based on the shape of the Y.\\
    & & The child is left-handed. \\
    & & The child looks at this area first. \\
    & & The experimenter coached the child on the response. \\
    & \multirow{2}{*}{Memory} & The child remembered where the dot was. \\
    & & The child forgot where the dot was. \\ \hline
    \multirow{3}{*}{Miscellaneous} & & The child placed the dot. \\
    & & A vacuous response. \\
    & & An uninterpretable response. \\ \hline
\end{tabular}
  }
\end{table}

Table 4.11 shows the proportion of participants providing explanations in each category (e.g., 8 of the 84 (10\%) who considered the expected observation in foresight attributed it to the child's mental rotation ability).  For the few participants (11) who gave more than one explanation, we only included the first.  As predicted, Invalid Method explanations were much more likely with unexpected results than with expected ones, in both foresight (28\% vs. 1\%) and hindsight (34\% vs. 0\%), a significant main effect (difference = 31\%; 95\% CI: [24\%, 38\%]) \emph{t} (339) = 9.09, \emph{p} $<$ 0.05, \emph{d} = 0.49. 

\begin{table}[h]
  \caption[Experiment Three Open-Ended Causal Attributions]{Proportion of participants making each attribution in open-ended response.  Standard errors generated from 10000 simulations from a posterior Dirichlet distribution with Jeffreys' Prior \cite{gelman2004bayesian,efron2011bayesian} Dirichlet($1/2$,$1/2$,$1/2$,$1/2$,$1/2$,$1/2$) since there was a category with zero observations. $FU=71$; $HU=87$; $HE=101$; $FE=84$.}
  \centering
  \begin{tabular}{c c c c}
    & & Foresight & Hindsight\\
    Causal Category & Condition & \% (95\% CI) & \% (95\% CI) \\ \hline
    \multirow{2}{*}{Rotate} & Expected (A) & 0.10 [0.04, 0.17]  & 0.13 [0.07, 0.20]\\
    & Unexpected (B) & 0.07 [0.03, 0.14] & 0.13 [0.07, 0.20]\\
    \multirow{2}{*}{Invalid Method} & Expected (A) & 0.01 [0.00, 0.05] & 0.00 [0.00, 0.02]\\
    & Unexpected (B) & 0.28 [0.19, 0.39] & 0.34 [0.25, 0.44]\\
    \multirow{2}{*}{Chance} & Expected (A) & 0.04 [0.01, 0.09] & 0.01 [0.00, 0.04]\\
    & Unexpected (B) & 0.06 [0.01, 0.11] & 0.00 [0.00, 0.03]\\
    \multirow{2}{*}{Other} & Expected (A) & 0.81 [0.70, 0.87] & 0.85 [0.76, 0.90]\\
    & Unexpected (B) & 0.46 [0.35, 0.58] & 0.49 [0.38, 0.59]\\
    \multirow{2}{*}{Miscellaneous} & Expected (A) & 0.05 [0.02, 0.11] & 0.01 [0.00, 0.05]\\
    & Unexpected (B) & 0.13 [0.06, 0.22] & 0.03 [0.01, 0.09] \\ \hline
\end{tabular}
\end{table}

\subsection{Discussion}

As in Experiment One and Slovic and Fischhoff \cite{slovic1977psychology}, participants expected the initial result to replicate consistently in 10 additional observations more often in hindsight than in foresight --- although that difference emerged here only with the expected observation (A).  Conversely, the probability of never replicating the initial observation was less likely in hindsight than foresight, with a non-significant trend for a larger difference when it was the expected one.  

Few participants recommended publishing the results, even when 10 children responded in the same way as the first, although somewhat more supported publication if the result was expected.  These recommendations were similar in foresight and hindsight.

As in Experiment Two, participants invoked Invalid Method more with unexpected results than with expected ones, both when choosing among fixed options (Table 6) and when offering their own (Table 8).  Moreover, these attributions were similar in hindsight and foresight, again suggesting that they are available if they are explicitly sought, as required by our attribution and data sharing tasks. 

\section{Experiment Four}
Experiment Four replicates Experiment Three with several refinements designed to provide more sensitive measures.  Based on the open-ended explanations in Experiment Three, Experiment Four divides the Invalid Method category in the structured attribution question into ``something wrong with the child'' and ``something wrong with the task.''  Next, because participants in Experiment Three so uniformly wanted a much larger sample before publication, Experiment Four adds a task asking them to predict the outcomes for 100 additional trials, then indicate whether they would publish that result.  

We also extend our study of error models in two ways.  First, we examine whether people who attribute results to a flawed method also feel that all outcomes are equally likely, by asking participants to predict how many of 100 additional children will place their dot in each of the three areas.  Finally, we ask how the researcher should respond, should that pattern actually be observed.

We predicted that an unexpected result (B) will encourage participants to believe that ``anything can happen,'' leading them to predict more uniform distribution across the three areas and to urge more cautious researcher responses.  As before, unexpected results should be more strongly attributed to methodological problems. 

\subsection{Method}
Design.  Experiment Four was a 2 (foresight vs. hindsight) by 2 (Area: A, B) design.  

\subsubsection{Participants}
For Experiment Four, participants were paid volunteers who responded to an Amazon MTurk ad offering them 1 dollar for participation in a 7-minute experiment.  312 of 465 individuals (67\%) passed the attention filter. Their average age was 31 years old (range: 18 -- 67); 135 were women (43\%).  

\subsubsection{Materials}
The instructions were the same as Experiment Three, with these modifications: 

\begin{flushleft}
  (a) Before learning (hindsight) or anticipating (foresight) the outcome of the initial observation, participants guessed the researcher's hypothesis:
\end{flushleft}

\begin{quote}
  What do you think the researcher's hypothesis is (give your best guess)? [Hypothesis]
\end{quote}

(b) After considering that initial result, participants answered a modified version of the structure attribution question from Experiment Three:
\begin{quote}
  What is the probability that? (Note: These five probabilities should total 100\%.)
  \begin{enumerate}
  \item The child's ability to mentally rotate the image caused the child to place the dot in Area A. [Rotate]
  \item The child was not paying attention, and this caused the child to place the dot in Area A. [Faulty Child]
  \item The task was confusing, and this caused the child to place the  dot in Area A. [Faulty Task] 
  \item Random chance caused the child to place the  dot in Area A. [Chance]
  \item There was some other cause not otherwise mentioned. [Other]
\end{enumerate}
\end{quote}

(c) Participants then predicted the next 100 observations and assessed their implications: 
\begin{quote}
  In a replication of this experiment with 100 additional children, how many children will place the dot in the following areas:
  \begin{enumerate}
  \item Area A
  \item Area B
  \item Area C
  \end{enumerate}
  If the replication of this experiment with 100 additional children comes out the way you expect, how should the researcher evaluate the hypothesis you guessed?
  \begin{enumerate}
  \item Have less confidence in the hypothesis 
  \item No change
  \item Have more confidence in the hypothesis 
\end{enumerate}

  If the replication of this experiment with 100 additional children comes out the way you expect, which of the following actions would you recommend that the researcher take?
  \begin{enumerate}
  \item Collect more data before publishing [MoreData]
  \item Publish without collecting more data [Publish]
  \item Do not publish any of the data [NoPublish]
\end{enumerate}
\end{quote}

\subsection{Results}
\subsubsection{Causal Attributions}

Table 4.12 shows the probabilities assigned to the five causal explanations of the initial observation.  Participants gave a higher probability to the child's mental rotation ability (as revealed by a valid method) when the dot was placed in area A rather than area B, in both foresight (40\% vs. 10\%) and hindsight (35\% vs. 10\%).  The main effect (difference = -30, 95\% CI: [-45, -15]) was statistically significant, \emph{t} (308) = 4.11, \emph{p} $<$ 0.05, \emph{d} = 0.23, with no interaction.  

Conversely, participants assigned higher probabilities to the two Invalid Method explanations after an unexpected observation (area B) than after an expected one (area A).  For Faulty Child, that was true in both foresight (25\% vs. 10\%) and hindsight (20\% vs. 10\%), with a significant main effect (difference = 15; 95\% CI: [8, 22]), \emph{t} (308) = 4.12, \emph{p} $<$ 0.05, \emph{d} = 0.23, and no interaction.  For Faulty Task, this was also true in both foresight (20\% vs. 10\%) and hindsight (25\% vs. 10\%), again with a significant main effect (difference = 10; 95\% CI: [4, 16]) \emph{t} (308) = 3.14, \emph{p} $<$ 0.05, \emph{d} = 0.18, and no interaction.  Attributions to Chance and Other Causes were unrelated to the reported outcome.  

\begin{table}[h]
  \caption[Experiment Four Causal Attributions]{Median probability (Md), standard errors (SE) for five causal attributions for Experiment Four. Sample sizes are ($H=\text{hindsight}$; $U=\text{unexpected}$): $FU=79$; $HU=72$; $HC=83$; $FC=78$.}
  \centering
  \begin{tabular}{c c c c}
    & & Foresight & Hindsight\\
    Possible Cause & Condition & Md (SE) & Md (SE)\\ \hline
    \multirow{2}{*}{Rotate} & Expected (A) & 40 (6.3) & 35 (5.5)\\
    & Unexpected (B) & 10 (3.0) & 10 (3.5)\\
    \multirow{2}{*}{Faulty Child} & Expected (A) & 10 (1.8) & 10 (1.9)\\
    & Unexpected (B) & 25 (3.1) & 20 (2.3)\\
    \multirow{2}{*}{Faulty Task} & Expected (A) & 10 (2.9) & 10 (2.0)\\
    & Unexpected (B) & 20 (2.6) & 25 (4.5)\\
    \multirow{2}{*}{Chance} & Expected (A) & 10 (2.8) & 10 (1.3)\\
    & Unexpected (B) & 10 (2.7) & 10 (0.9)\\
    \multirow{2}{*}{Other} & Expected (A) & 6 (2.5) & 5 (2.1)\\
    & Unexpected (B) & 10 (0.8) & 10 (1.0)\\ \hline
\end{tabular}
\end{table}

Thus, an expected result was more often attributed to a theory, whereas an unexpected result was more often attributed to methodological problems, with the child or the task.  There were no significant hindsight-foresight interactions, indicating similar responses to actual results and potential ones.

\subsubsection{Posterior Predictions}

Table 4.13 shows participants' predictions for the 100 additional children.  Both A and B were more likely when observed with the first child, with the main effect being significant for A (difference = -20; 95\% CI: [-8, -32]), \emph{t} (308) = 3.30, \emph{p} = $<$ 0.05, \emph{d} = 0.19 and for B (difference = 10; 95\% CI: [0.6, 19]), \emph{t} (308) = 2.13, \emph{p} $<$ 0.05, \emph{d} = 0.12.  These differences were the same in foresight and hindsight.

\begin{table}[h]
  \caption[Experiment Four Posterior Predictions]{Median number of children expected to place the dot in each area (Md) and standard error (SE) for Experiment Four. Sample sizes are ($H=\text{hindsight}$; $U=\text{unexpected}$): $FU=79$; $HU=72$; $HE=83$; $FE=78$}
  \centering
  \begin{tabular}{c c c c}
    & & Foresight & Hindsight\\
    Predicted Placement & Condition & Md (SE) & Md (SE)\\ \hline
    \multirow{2}{*}{Area A} & Expected (A) & 60 (4.7) & 60 (3.6)\\
    & Unexpected (B) & 40 (3.2) & 40 (3.6)\\
    \multirow{2}{*}{Area B} & Expected (A) & 10 (3.3) & 10 (2.4)\\
    & Unexpected (B) & 20 (3.1) & 25 (2.9)\\
    \multirow{2}{*}{Area C} & Expected (A) & 23 (2.2) & 20 (2.4)\\
    & Unexpected (B) & 30 (1.7) & 25 (2.8)\\ \hline
  \end{tabular}
\end{table}

In order to assess participants' tendency to treat the three areas (A,B,C) as equiprobable, we calculate Shannon Entropy ($ShEn$), as a measure of the diffuseness or ``flatness'' of their distribution of predicted dot placements:

\begin{equation}
  \centering
  ShEn(A,B,C) = \\   
  -P(A)\times\log_2(P(A)) - P(B)\times\log_2(P(B)) - P(C)\times\log_2(P(C))
\end{equation}
\newline
Here, P(A) is the proportion of children (out of 100) predicted to place the dot in area A, and so on.  With three response categories, the measure ranges from 0 (all 100 in one category) to 1.585 (a uniform distribution).  	
	
Overall, median Shannon Entropy was higher with an unexpected initial observation (B) than with an expected one (A), in both hindsight (1.44 vs. 1.10) and foresight (1.35 vs. 1.16), a significant main effect (difference = 0.20; 95\% CI: [0.02, 0.38]), \emph{t} (308) = 2.23, \emph{p} $<$ 0.05, \emph{d} = 0.13, and no interaction.  Thus, an unexpected initial result produced a stronger tendency to believe that ``anything can happen'' in the next 100 observations.
 
\subsubsection{Belief Change}

When they predicted the researcher's hypothesis, many participants explicitly indicated an area: A (84), B (12), or C (45).  Among the others, 45 gave answers implying A or C (e.g., ``the same position''; ``the mirror position''). 

Most participants (not shown) thought that if their prediction for the 100 observations came true, then the researcher should be more confident in her original hypothesis, regardless of the first observation that they considered---even though considering that observation significantly affected their predictions for those 100 observations.  This tendency was equally strong in hindsight and foresight.

Participants who predicted flatter distributions (as indicated by higher $ShEn$) were less likely to believe that the researcher should increase her confidence should those results be obtained (\emph{r} = -0.13; 95\% CI: [-0.24, -0.02]), \emph{t} (310) = 2.35, \emph{p} $<$ 0.05, and more likely to believe that she should decrease it, (\emph{r} = 0.12; 95\% CI: [0.01, 0.23]), \emph{t} (310) = 2.20, \emph{p} $<$ 0.05, should she observe such ambiguous results.  Thus, participants with less confident predictions (flatter distributions) saw those data as less diagnostic, hence meriting less change in confidence.

\subsubsection{Data Sharing Judgments}

Most participants (not shown) recommended collecting more data before publishing even if the 100 observations turned out as they had predicted --- although fewer did so than with the 10 additional observations in Experiment Three (82\% (49/358) in Experiment Three versus 65\% (92/312) in Experiment Four).  These judgments were unrelated to the initial result (A or B) or whether it was reported as observed (hindsight) or possible (foresight).  

Those who expected flatter distributions were less likely to recommend publishing without collecting more data (\emph{r} = -0.19; 95\% CI: [-0.30, -0.08]), \emph{t} (310) = 3.47, \emph{p} $<$ 0.05, more likely to recommend collecting more data before publishing, (\emph{r} = 0.16; 95\% CI: [0.05, 0.26]), \emph{t} (310) = 2.77, \emph{p} $<$ 0.05, and more likely to recommend not publishing any of the data (\emph{r} = 0.11; 95\% CI: [0.00, 0.22]), \emph{t} (310) = 2.00, \emph{p} = 0.05.  Thus, participants were less inclined to recommend sharing the data with the scientific community when they had more diffuse expectations.

\subsection{Discussion}

When participants considered the child placing the dot in the expected area, they were more likely to attribute that result to a substantive theory, that the child could mentally rotate the image, and less likely to attribute it to methodological problems, such as that the child was not paying attention or found the task confusing.  Participants in Experiments Two and Three attributed the unexpected result to chance or an `other cause', such as placing the dot where the child looks first, more than an expected result.  However, in this experiment, where the the response format captured their error models, participants attributed unexpected results to error, and not chance or other substantive theories.

The initial observation also made that result seem more likely in replications of the same experiment.  Although that was true for both the expected and the unexpected initial result, observing the latter made future results seem less predictable, in the sense of being more uniformly spread out across the possible outcomes. That difference was not just a reflection of making the unexpected outcome seem more likely, thereby leveling the distributions.  Rather, participants who considered B as the initial observation also saw C as significantly more likely, in both foresight (30 vs. 23) and hindsight (25 vs. 20)  (diff in medians = 10 (95\% CI [4,16]), \emph{t} (310) = 3.57, \emph{p} $<$ 0.05, \emph{d} = 0.20.  Apparently, the unexpected result led to thinking about alternative causal models.  As evidence that the entropy measure captured participants' uncertainty, it was correlated with how confident participants expected the researcher to be and how strongly they would recommend publishing the results.  

We once again observed no foresight-hindsight differences with any of the present measures.  As in Experiments Two and Three, the results suggest that the less certain perspectives of foresight are available in hindsight, if inferential processes are structured to evoke them.  In a debiasing study, \cite{slovic1977psychology} provided such structure by requiring hindsight participants to give reasons why the unreported outcome might have occurred. Here, we asked them to reflect on the causes and disposition of the results.  However, because these inferences were made after participants assessed the probability of replication, we have no direct evidence regarding their effectiveness as a debiasing procedure.

\section{Experiment Five}

In Experiment Four, foresight and hindsight participants were equally likely to invoke error models, both when they generated their own explanations and when they chose among explanations that we offered.  The similarity of those attributions, with and without outcome knowledge, suggests that people could generate error models at any time.  However, the intuition motivating our studies is that they typically do not do so until, in hindsight, unexpected outcomes motivate them to think even harder about what might have gone wrong.

If those additional reasons came from their own minds, rather than being generated in response to unexpected evidence, then they should have been incorporated in their prior knowledge.  Considering these error models before data collection would thus have allowed orderly, even Bayesian, updating.  However, having those considerations arise from unexpected observations means that such updating may be biased by the very results that prompted it.  For example, hindsight bias should make reasons consistent with the results disproportionately accessible.  Confirmation bias should give those reasons disproportionate credibility.  If so, then researchers' inferences will be biased toward supporting their initial hypotheses by virtue of undermining the credibility of unexpected (and perhaps unwanted) results.  Conversely, expected results will not prompt such a search for additional error model reasons.

Experiment Five creates conditions closer to actual foresight.  In the \emph{complete prior} condition, participants assess the potential relevance of three possible error models before they observe the data.  These three explanations vary in how strongly they favor areas A, B, and C.  In each of three incomplete prior conditions, one of these three explanations is omitted, so that it could be generated after observing the outcome of the experiment.  We expect participants to overweight the credibility of explanations that are initially omitted, then discovered when needed.

\subsection{Method}
Experiment Five was a 2 by 4 design, crossing two possible outcomes (A, B) with four sets of explanations given to participants before they considered an outcome.  

\subsubsection{Participants}
For Experiment Five, participants were paid volunteers who responded to an Amazon MTurk ad offering them 1 dollar for participation in a 7-minute experiment.  Using the same attention filter left 969 of 1628 individuals (60\%) who passed.  Their average age was 30 years old (range: 18--70); 408 were women (42\%).  

\subsubsection{Materials}
The instructions were the same as in Experiment Four, except that before considering the initial result, participants answered a modified version of the structured attribution question from Experiment Four.  In the complete prior condition, the question was:

\begin{quote}
  Which of the following do you think could possibly affect the experimental results (check all that apply)?
  \begin{enumerate}
  \item  The task is confusing. [Uniform]
  \item The children selected for the study are left-handed. [Non-Uniform area A]
  \item Children like putting things in the middle, to maintain symmetry.  [Non-Uniform area B]
  \item Some other cause. [Other]
\end{enumerate}
\end{quote}

In the three other incomplete prior conditions, one of the three alternative explanations (Non-Uniform area A; Non-Uniform area B; Uniform) was omitted.  They were meant to be available to explain outcome A, outcome B, or all three outcomes, respectively.

Participants were then told the first child placed the dot in either area A or area B and were asked to attribute the cause:
\begin{quote}
  What is the probability that? (Note: These five probabilities should total 100\%.)
  \begin{enumerate}
  \item The child's ability to mentally rotate the image caused the child to place the dot in Area A. [Rotate]
  \item The task was confusing, and this caused the child to place the  dot in Area A. [Uniform] 
  \item The child was left-handed, and this caused the child to place the dot in Area A. [Non-Uniform area A]
  \item The child likes putting things in the middle to maintain symmetry, and this caused the child to place the dot in Area A.  [Non-Uniform area B]
  \item Random chance caused the child to place the dot in Area A. [Chance]
  \item There was some other cause not already mentioned. [Other]
  \end{enumerate}
\end{quote}

Participants then predicted the next 100 observations and made data sharing judgments, as in Experiment Four. 

\subsection{Results}

\subsubsection{Causal Attributions}

Before considering any observations, 51\% of participants thought that the task being confusing could affect the results, 37\% that children being left-handed could do so, and 41\% that children's preference for symmetry could. Thus all three of these explanations had plausible effects.  

The probability assigned to the child having the mental ability to rotate the image (Rotate) was unaffected by which of the possible causes were mentioned before the result was observed.  That probability was significantly higher when the child placed the dot in area A (25; 95\% CI [22, 28]) rather than B (10; 95\%CI [8, 12]), \emph{t} (967) = 9.45, \emph{p} $<$ 0.05, \emph{d} = 0.30, consistent with that ability explaining the former result, but not the latter. 

Participants assigned the same probability to the child's confusion (Uniform) affecting the results regardless of whether that explanation was mentioned before the initial observation was reported.  That probability was higher with the unexpected observation (B) than with the expected one (A), (20; 95\% CI [18, 22]) vs. (10; 95\% CI [8, 12]) , \emph{t} (967) = 7.36, \emph{p} $<$ 0.05, \emph{d} = 0.24.  

The probability assigned to the child being left-handed (Non-Uniform area A) was affected by which of the possible causes were mentioned before the result was observed.  Using means rather than medians (there was little re-sampling variance of the median in the bootstrap), participants assigned higher probabilities to the child's left-handedness affecting the initial observation when that explanation was mentioned before the observation was reported.  This occurred both when they were told the child placed the dot in area A (13; 95\% CI [10.7, 15.8]) vs. (9; 95\% CI [7.1, 10.8]), \emph{t} (277) = 2.65, \emph{p} = 0.009, \emph{d} = 0.16, and when they were told the child placed the dot in area B (7.9; 95\% CI [5.7, 10.1]) vs. (5.6; 95\% CI [4.3, 7.0]) \emph{t} (207) = 1.78, \emph{p} = 0.077, \emph{d} = 0.12.  Thus, an explanation of an expected result was judged less plausible when it was not mentioned before observing the results, regardless of whether the results confirmed or disconfirmed those expectations.  There was also a significant main effect of area, such that the median probability assigned to the left-handed explanation was higher for participants told that the child placed the dot in area A (10; 95\% CI [6, 10]) rather than area B (5; 95\% CI [5, 5]), \emph{t} (967) = 5.69, \emph{p} $<$ 0.05, \emph{d} = 0.18. 

Next, participants told that the child placed the dot in area B assigned higher mean probabilities to the symmetry explanation (Non-uniform area B) when it was mentioned before the result was observed (24.4; 95\% CI [20.2, 28.7]) compared to when it was not (17.9; 95\% CI [14.3, 21.5]), \emph{t} (201) = 2.34, \emph{p} = 0.02, \emph{d} = 0.16.  This relationship did not hold for participants told that the child placed the dot in area A (8.3; 95\% CI [6.1, 10.5]) vs. (7.2; 95\% CI [4.8, 9.5]), \emph{t} (280) = 0.74, \emph{p} = 0.46, \emph{d} = 0.04.  There was also a significant main effect, such that the median probability assigned to symmetry as a cause was much lower for participants told that the child placed the dot in area A (1.5; 95\% CI [0, 5]) compared to participants told area B (20; 95\% CI [15, 20]), \emph{t} (967) = 9.79, \emph{p} $<$ 0.05, \emph{d} = 0.31.  Thus, as with the expected result, explanations for the unexpected result seemed less likely when not mentioned before the result was observed.  Unlike the explanation for the expected result, this only happened when the result was consistent with the explanation.  No differences emerged for chance and other causes for outcome or prior condition.

Thus, both non-uniform explanations were both judged less likely in hindsight when they were not mentioned in foresight.  Attributions to the uniform (anything goes) explanation were unaffected by mentioning it in foresight.

\subsubsection{Posterior Predictions}

Participants predicted more of the next 100 observations in area A when told that the initial observation was there (med=60; 95\% CI [60, 65]) than when told area B (35; 95\% CI [33, 40]), \emph{t} (967) = 10.8, \emph{p} $<$ 0.05, \emph{d} = 0.35.  The same was true when area B was reported (med=30; 95\% CI [30, 33]), compared to when it was not (10; 95\% CI [10, 15]), t (967) = 9.91, \emph{p} $<$ 0.05, \emph{d} = 0.32.

As in Experiment Four, area C was predicted more often by participants told that the initial observation was B (med=25; 95\% CI [25, 27]) rather than A (20; 95\% CI [20, 20]) \emph{t} (967) = 6.55, \emph{p} $<$ 0.05, \emph{d} = 0.21, even though area C was not mentioned.

As measured by $ShEn$, the distributions of these predictions were flatter after the unexpected initial observation than after the expected one (A:1.18; 95\% CI [1.16, 1.22]) vs. (B: 1.44; 95\% CI [1.37, 1.49]), \emph{t} (967) = 7.48, \emph{p} $<$ 0.05, \emph{d} = 0.24.  There were no main effects or interactions of mentioning explanations.

\subsubsection{Data Sharing Judgments}
	
Most participants again recommended collecting more data before publishing, even if the 100 observations turned out as they had predicted.  These patterns were unrelated to the initial observation and to which error models were mentioned beforehand.  As in Experiment Four, those who expected flatter distributions (with higher $ShEn$) were less likely to recommend publishing (\emph{r} = -0.12; 95\% CI: [-0.19, -0.06]), \emph{t} (903) = 3.7, \emph{p} $<$ 0.05, but not more likely to recommend not publishing any of the data (\emph{r} = -0.003; 95\% CI: [-0.08, 0.07]), \emph{t} (733) = 0.09, \emph{p} $>$ 0.05.  There were no other main effects or interactions between outcome reported, explanation mentioned, and publish judgments.  Thus, participants were more likely to recommend collecting more data when they had diffuse expectations for the outcome of exact replications of the same experiments.

\subsection{Discussion}

In Experiment Five, the two non-uniform explanations (the child was left-handed, the child prefers symmetry) were both assigned higher probabilities of causing the result when mentioned before participants learned the outcome of the initial observations.  That effect was greatest when the reported observation was consistent with the explanation, suggesting that causal models can be overlooked unless prompted --- by asking or observation.  Attributions to a diffuse explanation, producing uniform expectations (the child was confused), did not increase when it was mentioned sooner, rather than later, suggesting that such error models are always available.  Publication judgments and predictions were unrelated to which explanations were mentioned (and omitted).
	
As in Experiment Four, participants found area C more likely when the initial observation was unexpected (B).  This again suggests that the surprise made that seemingly unrelated outcome more plausible, consistent with ``surprise'' participants making more diffuse predictions overall, and being less likely to recommend publishing, even when the data confirmed their predictions.  

\section{General Discussion}

We present five experiments examining how the evaluation of scientific evidence differs when the results are expected or unexpected and when considered in foresight or hindsight.  Experiment One repeats Experiment One of Slovic and Fischhoff \cite{slovic1977psychology}, thirty-five years later with an online (MTurk) sample, and finds similar results: an initial observation seems more likely to be replicated when considered in hindsight compared to foresight.  Subsequent experiments examined responses to the most expected and unexpected results, among the four studies evaluated in Experiment One.  

In Experiment Two, contrary to our expectations, participants were equally likely to attribute expected and unexpected results to methodological error, in both foresight and hindsight.  Experiments Three and Four addressed a possible methodological problem with Experiment Two.  Appropriate to our topic, it was a measurement error in how we elicited attributions to such problems.  Experiment Three elicited explanations with an open-ended format, allowing participants to use their own concepts and terms.  Experiments Four and Five offered fixed alternatives based on these responses.  In all three experiments, unexpected results were more often attributed to flawed methods (or ``error models''), but not to chance or other causes, compared to expected ones.  These effects were similar in foresight and hindsight, suggesting that explicitly asking about alternative explanations equates these perspectives.  Experiment Five affirmed this observation by systematically varying which explanations were mentioned before any results were reported.  Mentioning explanations consistent with non-uniform predictions increased attributions to them, especially when consistent with the outcome.  Invocation of the uniform error model explanation, making no specific predictions, was unaffected by whether it was mentioned before or after the initial observation.

Experiments Four and Five also found that reporting an unexpected outcome led to flatter predictions for 10 or 100 additional observations, also consistent with surprises evoking error models.  In these predictions, observing the unexpected outcome (B) also increased the probability of the unrelated outcome (C), as though anything was now possible.  We found that participants who gave flatter predictions were also less willing to recommend publishing the results rather than collecting more data.

Many previous studies have found that unexpected results are more likely to be attributed to error \cite{gilovich1983biased,lord1979biased,mahoney1977publication,munro1997biased,ross1975perseverance,wyer1983effects}.  However, in these studies, the expected outcomes were typically also desired ones, for example, that capital punishment is an effective (or ineffective) crime deterrent \cite{lord1979biased}.  Thus, participants attributed outcomes that were unwanted as well as unexpected to error.  One exception is the finding that people often invoke error when confronted with disconfirming feedback in the Wason rule discovery task \cite{gorman1986possibility,penner1996trust}, although even there, participants may become invested in a favored hypothesis.  Here, participants considered studies run by others on a neutral topic, hence had expectations without desires.  Masnick \emph{et al.} \cite{masnick2009evaluating} found a similar result using short vignettes about the efficacy of pedagogical techniques.  In that study, participants who had natural expectations about the efficacy of the techniques, but no investment in their success, attributed unexpected results to methodological flaw. 

The experiments can provide guidance to practicing researchers.  Researchers wring their hands worrying about 'overfitting' unexpected data that weren't considered in foresight \cite{kerr1998harking}.  There is a real risk of becoming committed to a weak and unwarranted theory that just happens to fit the data well.  In Experiments Two through Four, explanations of data were seen as equally probable in foresight and hindsight.  This suggests that one needn't worry so much about whether these explanations will seem disproportionately likely.  Instead, our results indicate that, as long as the set of explanations remains fixed, judgments of them will not be affected. 

However, completely failing to consider an explanation entirely is another matter.  Experiment Five found that explanations that predicted specific results (e.g., area A or area B), as opposed to uniform explanations, are judged more probable when considered before observing the results, especially when they are consistent with the results.  Thus, researchers may be overly skeptical of theories that were generated after the data are observed, or conversely, not skeptical enough of explanations set forth ahead of time.  More ``natural'' explanations, that come to mind easily when designing an experiment, are also seen as relatively more likely after observing the results compared to explanations that may have required more thought (and even empirical observation) to generate.  Deeper probing in foresight may help, by making sure all explanations that are serious possibilities are considered before observing the results.  Unfortunately, there is no limit to this time consuming and often frustrating process, so the termination of this process ultimately depends on a judgment that the so-far-considered explanations are 'good enough'.

For both Experiments Four and Five, diffuse data were seen as due to a flawed experiment. As a consequence, researchers should reflect on the temptation to lock diffuse data away into their file-drawer \cite{rosenthal1979file}.  Researchers should be careful to examine data that may initially seem diffuse and uninformative, as it may be possible to discover systematic sources of error in the noise. As long as there are clues to the noise in the data, they can be very helpful for planning new experiments.  They should lay the foundation to make future experiments sensible, as Experiments Two and Three did in our case, where an initial failure to find that surprises were attributed to error in Experiment Two laid the foundation for discovering the categories of error that needed to be included in Experiment Three. 

One limit to the present research is its reliance on structured attribution options. The open-ended responses in Experiment Three revealed some of the diversity in how people intuitively formulate error model explanations. The structured options based on these responses revealed patterns missed by the ones that we produced intuitively in Experiment Two. Nonetheless, there is more to be learned by eliciting participants normal ways of thinking. As second limit is reliance on a single experimental stimulus, the Y-test study. As revealed in the manipulation check of Experiment Three, some participants interpreted it differently than we had expected. Any confusion on their part might have limited their ability to generate alternative explanations (and our ability to understand what they produced). 

The results point to several directions for future research.  In our experiments we do not look at the number and type of mentioned and omitted explanations.  We often consider only one explanation before observing our results.  When an explanation is the only one considered beforehand, alternative explanations generated after observing the data may gain little credibility, compared to if we had considered multiple explanations beforehand.  Experiment Five did not test this, because in all conditions participants considered at least three explanations beforehand.  Similarly, all of our experiments include all explanations at the time of measurement, possibly affecting the probability assigned to them compared to if they were left in an 'all else' category.  Explanations neither mentioned beforehand nor included at the time of measurement may be disproportionately ignored, similar to omission of possible failure modes in fault trees \cite{fischhoff1978fault}.

In a dynamic context, one could look at how participants create error models and data veto judgments in a flexibly defined ``warm-up'' period, similar to that attributed to Millikan. For example, examining how people make decisions about whether to continue pursuing research goals in the face of apparent anomalies. This pits Millikan's warm-up period, where one throws away anomalous data to maintain the research goal, against Polanyi's (and Mayo's \cite{mayo1996error}) wild goose chase, where one pursues anomalies as they arise before continuing the research project. These are two very different and important research strategies. Their relative merits may be more easily decidable on psychological rather than normative (philosophical) grounds. 

Other interesting future directions involve possible field experiments of the generation of causal explanations and data veto policies for unexpected results before or after scientific experiments are conducted. Disciplines amenable to this would include physics, such as work done at the Large Hadron Collider or LIGO, pharmaceutical drug discovery, and psychological research. These contexts are likely to elicit much richer causal reasoning. The meaning of publishing the data in this context is well-understood by those involved, so a manipulation such as specifying ex ante versus an ex post data veto policy, simulating what was done by LIGO, would also be very informative. 

\section{Conclusions}

Kuhn \cite{kuhn1996structure} asked, ``How do scientists proceed when aware only that something has gone fundamentally wrong at a level with which their training has not equipped them to deal?'' (p. 86)  He answered, in effect, that they naturally attribute unexpected results to flawed experimental method and expected ones to the theory that guides them.  It takes an accumulation of unexpected results, along with a deep insight, to prompt a scientific revolution.  Here, we found similar treatment of expected and unexpected results with lay participants evaluating a single study.  The practical implications of these results are seen in the data sharing policies that participants revealed.  Although participants were generally cautious about publishing any results, they had much more confidence (less diffuse predictions) in ones that confirmed their expectations.  Thus, by implication, unless scientists follow pre-specified data veto rules, they risk disproportionately discarding unexpected results. 

The task used here was taken from a study of hindsight bias, wherein people struggle to retrieve the uncertainty of foresight, increasing their confidence that an observed result with be replicated.  Producing reasons why another result was possible reduces that bias, by enriching hindsight.  Conversely, considering a fuller set of possible causal principles prior to observing any results can reduce the tendency to invoke error models to explain unexpected results, by enriching foresight.   

\chapter{Incentives, Error, and Data Sharing}

<<preamble,echo=false,results=hide,fig=false>>=
load("was1")
#install.packages("arm")
library(arm)
options(scipen=0,digits=2)
n1<-3
n2<-3
n3<-3
ptrunc<-function(x){ifelse(x<0.001,0.001,x)}
@

Hypothesis testing has been found to follow a \emph{positive test strategy}.  Researchers collect data that they expect to conform to their prior beliefs and then exaggerate its information value \cite{klayman1987confirmation}, while discounting any inconsistent evidence that comes their way \cite{dunbar1995scientists,dunbar2001scientific,gorman2005scientific,lord1979biased}. This result has been found with both simple experimental tasks and in dynamic artificial environments, such as simulated molecular biology \cite{dunbar1993concept}, programmed robots \cite{klahr1988dual}, and multiple-cue probability learning \cite{o1989effects}.  Similar patterns have been found in scientific laboratories.  For example, in an observational study of a biological sciences laboratory, Dunbar \cite{dunbar2001scientific} found that scientists did not immediately reject their hypotheses after they were contradicted by data.  Rather, their first reaction was to invoke experimental error \cite{dunbar1995scientists,dunbar2001scientific,gorman2005scientific}.  In thirty-seven experimental treatments conducted by one biologist, twenty-one had unexpected results, most of which were treated as errors \cite{dunbar2001scientific}.

After inspecting their data for errors, researchers must decide whether to communicate any data that they consider flawed.  If the researchers' error attributions are accurate, then omitting these errors from published reports may avoid distracting readers.  If they are inaccurate, then failing to publish those data will allow false theories to emerge and persist.  Justified or not, data attributed to error are unlikely to be shared.  For example, statistical significance is often (incorrectly) interpreted as the probability of error in data, and is usually a necessary condition for publication \cite{fanelli2012negative,sterling1959publication}.

Data sharing decisions are not only affected by whether the data are perceived to be faulty, but also by professional rewards for publishing positive (usually statistically significant) results.  These rewards can produce a healthy motivation to make a discovery, such as finding a successful anti-cancer drug.  However, rewards may also undermine accurate data inspection by increasing scrutiny of results that indicate the discovery is false, while simultaneously making affirming results a wanted relief from the pressure to produce \cite{kunda1990case}.  Supporting this account, there is evidence that higher rewards for publishing are associated with publication bias \cite{ioannidis2005early,fanelli2010pressures}.

Here we present three experiments on decisions to share possibly faulty data.  We use Wason's 2-4-6 rule-discovery task \cite{wason1960failure}.  It asks participants try to discover the rule that generated a set of numbers (2, 4, 6), by proposing a new set of three numbers (a proposed triple), then getting feedback as to whether the numbers that they proposed fit the rule.  We use Penner and Klahr's version \cite{penner1996trust}, in which participants are told that some percentage of the time, the feedback will be false---a feature that adds something like the uncertainty that is inevitable with scientific inferences. 

We add several new features to the task.  (a) Before receiving feedback, participants assess the probability that it will affirm their expectations.  (b) After receiving it, they indicate whether they would share each trial, including the feedback, with a second person trying to discover the same rule. In Experiment One, the sharing decision is done at the end of the task; in Experiments Two and Three it is done immediately after they make their error judgments.  (c) We also use two types of incentives intended to simulate the rewards that may lead to motivated reasoning.  Experiment Two provides participants with a large incentive (\$100) for correctly guessing the rule, and a small incentive (\$1) for concluding that they do not know the answer.  Experiment Three provides participants with an incentive to convince a matched participant that they discovered the rule, whether or not they actually did.  
%(d)  In Experiment Four, a matched participant receives either the full data set or only the data that the original participant chose to share.  That new participant then tries both to guess the true hypothesis and to assess the validity of the original participant's guess.

Using this task, we first replicate the finding that error is more likely to be invoked with disconfirming than with affirming feedback \cite{penner1996trust,gorman1986possibility,gorman1989error}.  We then examine whether these error attributions are justified using two evaluative criteria: (a)  \emph{accuracy}, defined as whether the judgments are correct; and (b) \emph{Bayesian consistency}, defined as attributing feedback to error if and only if either: (i) participants strongly expected the triple they proposed to fit the rule but it did not, or (ii) participants strongly expected the triple they proposed to not fit the rule but it did.\footnote{More precisely, using Bayes' Rule it can be shown that feedback should be attributed to error whenever one believes that there is greater than an 80\% prior probability that the triple fit the rule, but the feedback indicates it does not fit, or conversely one believes there is less than a 20\% prior probability that the triple fit the rule, but the feedback indicates that it does fit.}  \emph{Selective reporting} is the degree to which trials attributed to error are not shared with the matched participant, compared to those attributed to other sources.  

%The effect of selective reporting is measured as the matched participant's ability to judge whether the original participant's conclusion was correct and guess the true hypothesis, with either all the data that the original participant collected or just with the data selected for transmission.

\section{Experiment One}
Experiment One looks at whether error attributions are consistent with prior beliefs, whether error attributions correspond to actual error, and whether trials are less likely to be shared with another person when the feedback is attributed to error.  We used the Wason 2-4-6 rule discovery task with feedback error \cite{penner1996trust}.  On each trial there was a 20\% chance that the feedback was in error.

\subsection{Method}
\subsubsection{Participants}
Eighteen Carnegie Mellon University undergraduates completed the task for course credit.\footnote{A random-effects meta-analysis of the effect of disconfirming feedback on error attributions from Penner and Klahr's Study One (numerical broad and narrow conditions), Penner and Klahr's Study Two (numerical/narrow) \cite{penner1996trust} and Gorman \cite{gorman1986possibility}, indicated an overall Hedges' $G=0.43$, 95\% CI [0.33, 0.53].  The sample sizes of the control groups in these studies were 15, 15, 25, and 24, respectively, indicating eighteen participants should provide sufficient power to detect the effect of feedback on error attribution.}  They were on average 21 years old (range: 18 -- 38).  There were 7 women.  One participant gave no valid responses. 

\subsubsection{Procedure}
Participants were seated at a computer, asked to sign an informed consent document, and then instructed that they had 30 minutes to complete the task.  Participants completed the task online as a Qualtrics questionnaire with embedded Javascript used for feedback.  Each page (trial) of the questionnaire had the same format.  In order, participants proposed a rule, proposed a new triple, assessed the probability that the triple they proposed fit the Actual Rule, received feedback, judged whether the feedback reflected error, and then decided if they wanted to give their Final Answer.  They were reminded to record all responses both on the computer and on the spreadsheet they were given.  After participants decided to stop new trials and give their Final Answer, or thirty minutes had passed, they were asked to review their spreadsheet and mark the trials that they thought should be shared, in order to help a new participant solve the problem.

\subsubsection{Materials}
The materials were a modification of Penner and Klahr's \cite{penner1996trust} version of the Wason 2-4-6 rule discovery task, with the study introduction rewritten to increase readability and comprehension, based on pretesting using cognitive interviews.  The study also used computerized, rather than hand-written, feedback.

\subsubsection{Introduction}

Participants were shown the following introduction on the computer along with a separate paper copy as a reminder:

\begin{quote}
``You will be given three numbers that are related somehow. For example: 3, 5, and 15. This is called a triple. There are many possible rules that could relate these three numbers. We have selected only one of them. The rule that we selected is called the Actual Rule. You will not be given the Actual Rule. Your task is to discover it. The initial triple on the next page is an example drawn from the Actual Rule.''

``Our study is using several versions of this task.  Yours is a particularly difficult one.  Sometimes, even if your Proposed Triple FITs the Actual Rule, the computer may output that it DOES NOT FIT. Conversely, sometimes, when your Proposed Triple DOES NOT FIT the rule, the computer may output that it FITs. On any trial there is a 20\% chance that you will get false feedback. For each trial if you think false feedback occurred mark ``F'' in the ``Feedback'' column on your spreadsheet. If you think true feedback occurred, mark ``T'' in the ``Feedback'' column.''

``At any time you may try to guess the Actual Rule that we selected. This is called the Final Answer. You only get one Final Answer and it may be wrong. Once you make your Final Answer you can no longer get feedback from the computer and the experiment will end.''
\end{quote}

\subsubsection{Initial Triple}
At the top of each page, the initial triple (2,4,6) was shown.  Participants were told:
\begin{quote}
``The initial triple above is an example drawn from the Actual Rule.''
\end{quote}

\subsubsection{Proposed Triple}
After writing their best explanation of the initial triple, they were instructed to propose a new triple:
\begin{quote}
``You may propose additional triples to help you discover the Actual Rule. The computer will tell you whether the triple you proposed fits the Actual Rule. Record all information on the spreadsheet you were given.  Write one number of your triple in each box below.''
\end{quote}

\subsubsection{Prior Probability}

On each trial, before they received feedback, participants assigned a probability that the triple they proposed fit the actual rule, by answering the following question:

\begin{quote}
``What is the probability that the triple you proposed fits the Actual Rule? (must be a number between 0 and 100)'' 
\end{quote}

\begin{flushleft}
We denote this $P(TFTR)$ for `[P]robability that the [T]riple [F]its [T]he actual [R]ule'.
\end{flushleft}

\subsubsection{Error judgment}

Immediately after receiving feedback that the triple fit (FIT) or did not fit (DNF) the rule, participants judged whether they thought the feedback was due to error:

\begin{quote}
``Do you think this feedback was true or false? (True/False)''
\end{quote}

\subsubsection{Final Answer}
After participants felt they had completed enough trials, or the 30--minute window expired, they were asked to make their Final Answer:
\begin{quote}
``Write your Final Answer for the Actual Rule in the box below (it can be mathematical or in words).''
\end{quote}

\subsubsection{Data sharing}

Participants then decided which trials they wanted to share with a new participant:

\begin{quote}
``In this experiment, a trial is a page where you proposed a rule, a triple, a probability estimate, received feedback, and judged whether you thought the feedback was false or true.'' 

``In a future experiment we will have a new participant try to discover the same rule you tried to discover. 

You can choose trials that you think will help him or her solve the rule. 
For each trial you indicate, all of the information would be shared, including: 
\begin{enumerate}
\item your rule
\item the proposed triple
\item your probability estimate
\item the feedback
\item whether you thought the feedback was false or true
\end{enumerate}

In the space below, please indicate the trials you conducted that you think would help this person.'' 
\end{quote}

\subsection{Results}

<<exp1,results=hide,echo=false,fig=false>>=
##Hierarchical linear model for correlation between prior (P(TFTR)) and error attribution##
wasqa<-subset(was1,falsification==0)
wasqa$error.should<-ifelse(wasqa$p.fit<0.2,1,0)

phiaff<-glmer(error.should~error+(1|subjectid),data=wasqa,family=binomial(link="logit"))
phiaff.null<-glmer(error.should~(1|subjectid),data=wasqa,family=binomial(link="logit"))
affnova<-anova(phiaff.null,phiaff)
affchip<-affnova$Chisq[2]
affphip<-sqrt(affnova$Chisq[2]/length(wasqa$error.should))
affphi.p<-affnova[2,7]

wasqf<-subset(was1,falsification==1)
wasqf$error.should<-ifelse(wasqf$p.fit>0.8,1,0)

phidisc<-glmer(error.should~error+(1|subjectid),data=wasqf,family=binomial(link="logit"))
phidisc.null<-glmer(error.should~(1|subjectid),data=wasqf,family=binomial(link="logit"))
discnova<-anova(phidisc.null,phidisc)
discchip<-discnova$Chisq[2]
discphip<-sqrt(discnova$Chisq[2]/length(wasqf$error.should))
discphi.p<-discnova[2,7]

wasq<-rbind(wasqa,wasqf)
glmer.ov<-glmer(error.should~error+(1|subjectid),data=wasq,family=binomial(link="logit"))
glmer.ov.null<-glmer(error.should~(1|subjectid),data=wasq,family=binomial(link="logit"))
nova.ov<-anova(glmer.ov.null,glmer.ov)
chi.ov<-nova.ov$Chisq[2]
phi.ov<-sqrt(chi.ov/length(wasq$actual.error))
should.error<-glmer(error~error.should*falsification+(1|subjectid),data=wasq,family=binomial(link="logit"))

####
##Nested Boostrap##
chip<-c()
phip<-c()
phi.p<-c()
glmeraa<-c()
glmerff<-c()
glmershnerr<-c()
glmersherr<-c()
glmersha<-c()
glmershf<-c()
qf<-was1
qf$new.id<-c(rep(0,length(qf$subjectid)))
for(i in 1:n1){
a<-seq(18)
a<-a[!a==13]
a<-sort(sample(a,size=17,replace=TRUE))
new.id<-seq(length(a))
a<-cbind(a,new.id)
a<-data.frame(a)
q1<-data.frame()
for(j in 1:length(a[,2])){
  qf$new.id[qf$subjectid==a$a[j]]<-j
q1<-rbind(q1,qf[qf$subjectid==a$a[j],])
} 
glmerfals<-glmer(error~falsification+(1|new.id),data=q1,family=binomial(link="logit"))
glmeraa[i]<-invlogit(fixef(glmerfals)[1])
glmerff[i]<-invlogit(fixef(glmerfals)[1]+fixef(glmerfals)[2])
glmerphi<-glmer(actual.error~error+(1|new.id),data=q1,family=binomial(link="logit"))
glmerphi.null<-glmer(actual.error~(1|new.id),data=q1,family=binomial(link="logit"))
nova<-anova(glmerphi.null,glmerphi)
chip[i]<-nova$Chisq[2]
phip[i]<-sqrt(nova$Chisq[2]/length(q1$actual.error))
phi.p[i]<-nova[2,7]
glmershfee<-glmer(shared~falsification+(1|new.id),data=q1,family=binomial(link="logit"))
glmersha[i]<-invlogit(fixef(glmershfee)[1])
glmershf[i]<-invlogit(fixef(glmershfee)[1]+fixef(glmershfee)[2])
glmershatt<-glmer(shared~error+(1|new.id),data=q1,family=binomial(link="logit"))
glmershnerr[i]<-invlogit(fixef(glmershatt)[1])
glmersherr[i]<-invlogit(fixef(glmershatt)[1]+fixef(glmershatt)[2])
}
glmeraa.se<-sd(glmeraa)
glmerff.se<-sd(glmerff)
glmershf.se<-sd(glmershf)
glmersha.se<-sd(glmersha)
glmersherr.se<-sd(glmersherr)
glmershnerr.se<-sd(glmershnerr)
glmerfals<-glmer(error~falsification+(1|subjectid),data=was1,family=binomial(link="logit"))
glmershfee<-glmer(shared~falsification+(1|subjectid),data=was1,family=binomial(link="logit"))
glmershatt<-glmer(shared~error+(1|subjectid),data=was1,family=binomial(link="logit"))
glmerfalsact<-glmer(error~falsification*actual.error+(1|subjectid),data=was1,family=binomial(link="logit"))
###
##total bayes trials##
bayes.trials<-sum(was1$error[was1$falsification==1 & was1$p.fit>0.8])+sum(was1$error[was1$falsification==0 & was1$p.fit<0.2])+length(was1$error[was1$falsification==0 & was1$p.fit<0.8])-sum(was1$error[was1$falsification==1 & was1$p.fit<0.8])+length(was1$error[was1$falsification==0 & was1$p.fit>0.2])-sum(was1$error[was1$falsification==0 & was1$p.fit>0.2])
total.trials<-length(was1$error[was1$falsification==1 & was1$p.fit>0.8])+length(was1$error[was1$falsification==1 & was1$p.fit<0.8])+length(was1$error[was1$falsification==0 & was1$p.fit<0.2])+length(was1$error[was1$falsification==0 & was1$p.fit>0.2])
###
@ 

Unless otherwise noted, all estimation was done using hierarchical logistic models with subject-level varying intercepts \cite{gelman2007data,gelman2010arm}.  The model assumes that multiple observations from the same person are conditionally independent given the subject-specific intercept.  Tests, standard errors, and p-values based on these models were calculated using non-parametric bootstrap with \Sexpr{n1} simulations per statistic \cite{efron1993introduction}.

\subsubsection{Performance}
Participants completed a median of eight trials.  Each participant's task performance score was determined by their final answer, scored on a 5--point scale awarding one point for each element of the rule that they had discovered.  The five elements were: 1) even numbers, 2) consecutive numbers, 3) ascending numbers, 4) the lower bound is 2, and 5) the upper bound is 100.  All six participants who scored zero used a mathematical formula that was either unspecific (e.g., $x+2$) or not a rule (e.g., $(2+6)/2=4$).  Among the seven participants with a score of 1, six included even numbers in their answer, and one mentioned ascending numbers.  Of the four participants who scored 2 on the task, two mentioned consecutive even numbers, and two mentioned ascending evens.  The one participant who scored a 3 on the task guessed sequential even numbers less than 100. 

\subsubsection{Error Attributions}

Replicating Penner and Klahr \cite{penner1996trust}, participants judged disconfirming feedback to be error more often ($\Sexpr{prettyNum(100*mean(glmerff))}\%; SE = \Sexpr{prettyNum(100*glmerff.se)}\%)$ than affirming feedback ($\Sexpr{prettyNum(100*mean(glmeraa))}\%; SE = \Sexpr{prettyNum(100*glmeraa.se)}\%)$, $t (162) = 3.77$ $\emph{p} < 0.05$, $\emph{d} = 0.30$.  In multiple regression, there was only a main effect of feedback type on attributions of error ($t(159)=$ \Sexpr{prettyNum(fixef(glmerfalsact)[2]/sqrt(diag(vcov(glmerfalsact))[2]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(glmerfalsact)[2]/sqrt(diag(vcov(glmerfalsact))[2]),159)*2))}), with no significant main effect of actual error ($t(159)=$ \Sexpr{prettyNum(fixef(glmerfalsact)[3]/sqrt(diag(vcov(glmerfalsact))[3]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(glmerfalsact)[3]/sqrt(diag(vcov(glmerfalsact))[3]),159)*2))}) or interaction between the two factors ($t(159)=$ \Sexpr{prettyNum(fixef(glmerfalsact)[4]/sqrt(diag(vcov(glmerfalsact))[4]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(glmerfalsact)[4]/sqrt(diag(vcov(glmerfalsact))[4]),159)*2))}).  

\subsubsection{Bayesian Consistency}

<<was1cons,echo=false,fig=false,results=hide>>=
#install.packages("quantreg")
#install.packages("ggplot2")
library(ggplot2)
library(quantreg)

wasqa<-subset(was1,falsification==0)
wasqa$error.should<-ifelse(wasqa$p.fit<0.2,1,0)
wasqf<-subset(was1,falsification==1)
wasqf$error.should<-ifelse(wasqf$p.fit>0.8,1,0)

phiaff<-lm(error~error.should,data=wasqa)
phidisc<-lm(error~error.should,data=wasqf)

hd.new<-data.frame(error.should=1)
hc.new<-data.frame(error.should=0)
fd.new<-data.frame(error.should=1)
fc.new<-data.frame(error.should=0)
hd.pred<-predict(phiaff,hd.new,interval="confidence",level=0.95)
hc.pred<-predict(phiaff,hc.new,interval="confidence",level=0.95)
fd.pred<-predict(phidisc,fd.new,interval="confidence",level=0.95)
fc.pred<-predict(phidisc,fc.new,interval="confidence",level=0.95)
png(file="was1cons.png",width=1500,height=1000,res=200)
# Create a simple example dataset 
df <- data.frame(Hindsight= factor(c("Error","Not Error","Error","Not Error")),Probability = c(hd.pred[1],hc.pred[1],fd.pred[1],fc.pred[1]),Feedback = factor(c("Affirming","Affirming","Disconfirming","Disconfirming")),lim.low = c(max(0,hd.pred[2]),max(0,hc.pred[2]),max(0,fd.pred[2]),max(0,fc.pred[2])),lim.high=c(min(1,hd.pred[3]),min(1,hc.pred[3]),min(1,fd.pred[3]),min(1,fc.pred[3]))) 
# Define the top and bottom of the errorbars 1[was3d$incentive==0,]
limits <- aes(ymax = lim.low, ymin=lim.high) 
p <- ggplot(df, aes(colour=Feedback, y=Probability, x=Hindsight),opts(panel.grid.major = theme_bw() ,panel.grid.minor = theme_bw(),panel.background = theme_bw(),axis.ticks = theme_blank())) 
p + geom_crossbar(limits, width=0.1,position="dodge",size=0.3,linetype=1,alpha=1,fatten=5)+scale_y_continuous(limits = c(0,1))+theme_bw()+ylab("Proportion Attributed to Error")+xlab("Bayes' Rule")+scale_colour_manual(values = c("darkred","darkblue"))+opts(legend.position="bottom") 
  dev.off()

share.should<-glmer(shared~actual.error*error+(1|subjectid),data=was1,family=binomial(link="logit"))
@

\begin{figure}[h] \pause
    \centering
\scalebox{1.4}{\includegraphics{was1cons}}
\caption[Trials Attributed to Error Compared to Bayes' Rule]{Proportion of trials attributed to error depending on whether Bayes' Rule predicted error attribution and whether the feedback was affirming or disconfirming.}
\end{figure}

These error attributions were consistent with prior beliefs.  When participants received disconfirming feedback, they correctly attributed \Sexpr{prettyNum(sum(was1$error[was1$falsification==1 & was1$p.fit>0.8]))} of \Sexpr{prettyNum(length(was1$error[was1$falsification==1 & was1$p.fit>0.8]))} trials to error when they strongly expected the triple to fit the rule beforehand ($P(TFTR)>0.8$), and incorrectly attributed \Sexpr{prettyNum(sum(was1$error[was1$falsification==1 & was1$p.fit<0.8]))} of \Sexpr{prettyNum(length(was1$error[was1$falsification==1 & was1$p.fit<0.8]))} trials to error when the strength of their prior beliefs did not justify attributing the feedback to error, ($P(TFTR<0.8)$), $\chi^{2}(1) = \Sexpr{prettyNum(discchip)}$, $p<0.05$, $\phi = \Sexpr{prettyNum(discphip)}$.  When receiving affirming feedback, they correctly attributed \Sexpr{prettyNum(sum(was1$error[was1$falsification==0 & was1$p.fit<0.2]))} of \Sexpr{prettyNum(length(was1$error[was1$falsification==0 & was1$p.fit<0.2]))} trials to error when they strongly expected the triple to not fit the rule beforehand ($P(TFTR)<0.2$), and \Sexpr{prettyNum(sum(was1$error[was1$falsification==0 & was1$p.fit>0.2]))} of \Sexpr{prettyNum(length(was1$error[was1$falsification==0 & was1$p.fit>0.2]))} incorrectly when the strength of their prior beliefs did not justify attributing the feedback to error ($P(TFTR>0.2)$), $\chi^{2}(1)=$ \Sexpr{prettyNum(affchip)}, $p=$ \Sexpr{prettyNum(ptrunc(affphi.p))}, $\phi = \Sexpr{prettyNum(affphip)}$.  However, in multiple regression, there was a main effect of feedback type on attributions of error ($t(159)=$ \Sexpr{prettyNum(fixef(should.error)[3]/sqrt(diag(vcov(should.error))[3]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(should.error)[3]/sqrt(diag(vcov(should.error))[3]),159)*2))}), no significant main effect of Bayes' Rule requiring error attribution ($t(159)=$ \Sexpr{prettyNum(fixef(should.error)[2]/sqrt(diag(vcov(should.error))[2]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(should.error)[2]/sqrt(diag(vcov(should.error))[2]),159)*2))}), and no interaction between the two factors ($t(159)=$ \Sexpr{prettyNum(fixef(should.error)[4]/sqrt(diag(vcov(should.error))[4]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(should.error)[4]/sqrt(diag(vcov(should.error))[4]),159)*2))}).  The overall correlation between their judgments and the consistency criterion was $\phi=$ \Sexpr{prettyNum(phi.ov)}, $\chi^{2}(1)=$ \Sexpr{prettyNum(chi.ov)}, $p<0.05$.\footnote{We use hierarchical linear probability models with asymptotic standard errors to estimate the correlation between prior belief (P(TFTR)) and error attribution.  To evaluate coherence, we also estimated the following equations.  Equations (1) and (2) show formulas for coherent judgment for falsifying and affirming feedback, respectively.  In both equations, $\alpha=0.2$ for a Bayesian reasoner, but $\alpha_{f}$ (falsifying) and $\alpha_{a}$ (affirming) need not equal 0.2 for a non-Bayesian.  For notational simplicity, let $x=P(TFTR)$:
\\
\begin{equation}
\frac{\alpha_{f} x}{\alpha_{f} x+0.8(1-x)}
\end{equation}
\begin{equation}
\frac{\alpha_{a}(1-x)}{\alpha_{a}(1-x)+0.8x}
\end{equation}
\\

To evaluate whether participants behaved in a Bayesian manner we use a quasi-Bayesian model.  In the quasi-Bayesian model, participants are not constrained to believe that error is exactly 20\% both when receiving affirming and falsifying evidence.  Instead, this model estimates the participants' belief in error separately for both affirming and falsifying evidence.  Thus, the Bayesian model is a special case of the quasi-Bayesian model where the participant believes affirming and falsifying evidence have equal error rates exactly equal to 20\%.  The quasi-Bayesian model is fit using non-linear weighted least squares \cite{team2010r}.  We assume equal value to classification mistakes, so each model predicts error if the posterior probability of error is greater than 50\%, and no error otherwise.

Although participants attributed feedback to error more when they observed falsifying rather than affirming feedback, this may be coherent when taking their beliefs into account.  If participants attribute falsifying feedback to error when $P(TFTR)>0.8$, and affirming feedback to error when $P(TFTR)<0.2$, they are consistent with the normative Bayesian prescription.   

As the Bayesian model predicts, they attributed error to falsification more when they had a strong prior belief that their triple fit the actual rule. Similarly, as the Bayesian model predicts, they attributed error to affirmation less when they had a strong prior belief that their triple will fit the actual rule.  Pooling all the data together rather than using the hierarchical model, the Bayesian (log likelihood = $LL = -79$) and quasi-Bayesian ($LL = -76$) models fit the data almost as well as Maximum Likelihood logistic regression ($LL = -72$) and a non-parametric second-order Gaussian kernel density estimator ($LL = -64$; \cite{hayfield2008nonparametric}), that both had more free parameters (4 or unlimited, respectively).  The parameters of the quasi-Bayesian model were $\alpha_{f}=0.24$ (95\% CI [0.12, 0.36]) and $\alpha_{a}= 0.06$ (95\% CI [0.01, 0.11]) for falsification and affirmation, respectively.}

\subsubsection{Accuracy}
Although error attributions were consistent with prior beliefs, they did not match actual error.  When participants believed that feedback was false, it was as likely to be accurate as inaccurate (23\% vs. 29\%), $\chi^{2}(1)=$ \Sexpr{prettyNum(mean(chip))}, $p=$ \Sexpr{prettyNum(ptrunc(mean(phi.p)))}, $\phi=$ \Sexpr{prettyNum(mean(phip))}.

\subsubsection{Data Sharing}
Participants were as likely to share data when feedback affirmed their hypothesis as when it did not, (\Sexpr{prettyNum(100*mean(glmersha))}\%; $SE=$ \Sexpr{prettyNum(100*glmersha.se)}\% vs. \Sexpr{prettyNum(100*mean(glmershf))}\%; $SE=$ \Sexpr{prettyNum(100*glmershf.se)}\%), $t(122)=$ 0.48, $p>$ 0.05.  They were also equally likely to share feedback when they saw it accurate or inaccurate (\Sexpr{prettyNum(100*mean(glmershnerr))}\%, $SE=$ \Sexpr{prettyNum(100*glmershnerr.se)}\% vs. \Sexpr{prettyNum(100*mean(glmersherr))}\%, $SE=$ \Sexpr{prettyNum(100*glmersherr.se)}\%), $t(122)=$ 0.53, $p>$ 0.05.  When including both main effects and the interaction between actual error and attribution of error to predict whether each trial would be shared, there was neither a significant main effect of error attribution ($t(119)=$ \Sexpr{prettyNum(abs(fixef(share.should)[3]/sqrt(diag(vcov(share.should)))[3]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(share.should)[3]/sqrt(diag(vcov(share.should)))[3],119)*2))}), actual error ($t(119)=$ \Sexpr{prettyNum(abs(fixef(share.should)[2]/sqrt(diag(vcov(share.should)))[2]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(share.should)[2]/sqrt(diag(vcov(share.should)))[2],119)*2))}), or an interaction between the two factors ($t(119)=$ \Sexpr{prettyNum(abs(fixef(share.should)[4]/sqrt(diag(vcov(share.should)))[4]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(share.should)[4]/sqrt(diag(vcov(share.should)))[4],119)*2))}). 

\subsection{Discussion}

The results replicate the findings of Gorman \cite{gorman1989error} and Penner and Klahr \cite{penner1996trust}, who found that people are more likely to question feedback when it disconfirms their hypothesis.  For error attributions, most of these judgments were normatively justified, matching the consistency criterion on \Sexpr{bayes.trials} of \Sexpr{total.trials} trials (\Sexpr{prettyNum(100*bayes.trials/total.trials)}\%).  In spite of this consistency, participants were unable to identify actual error.  Finally, on a task new to this study, participants shared information at equal rates regardless of whether the feedback was affirming or disconfirming and regardless of whether it was attributed to error.

The positive test strategy \cite{klayman1987confirmation} entails seeking affirming evidence and discounting disconfirming evidence.  Experiment One found that this strategy is both internally consistent and inaccurate.  Participants were, however, no less likely to share disconfirming or seemingly flawed data.  Although this pattern of data sharing contradicts the positive test strategy, we observed that some participants had difficulty interpreting the open-ended data-sharing question.  Namely, when asked which trials they wanted to share, some responded with a triple (e.g., ``2, 4, 6''), rather than a trial (e.g., ``trial 3'').  Experiment Two addresses this problem by using a fixed response format after each trial rather than an open-ended one at the end.

\section{Experiment Two}

Experiment One replicated the positive test strategy found in previous studies, with participants invoking error more often for disconfirming feedback.  These error attributions were justified in terms of the consistency criterion, but not the accuracy criterion.  The second experiment examines the effects of incentives on these judgments, using a monetary payoff that encourages participants to convince themselves that they know the rule.  Specifically, participants were offered \$100 for guessing the Actual Rule correctly, and \$1 for concluding that they do not know it.  This incentive scheme sought to encourage motivated reasoning, so that hopeful participants believe that they've reached a correct conclusion rather than assess their knowledge candidly.  

Experiment Two also improves the data-sharing decision.  Immediately after receiving feedback, participants make a binary (Yes--No) decision about whether each trial should be shared.  Having data-sharing decisions at the end of each trial rather than at the end of the experiment sought to make it clearer that the sharing decision applies to the current trial, resolving any ambiguity about whether triples or trials should be shared.  It also elicits sharing judgments earlier in the task, before participants might become tired or frustrated.

\subsection{Method}

\subsubsection{Participants} 
Fifty-eight Carnegie Mellon University undergraduates participated in the experiment for course credit. There were thirty-four women, with average age of 20 years (range: 18 -- 24).

\subsubsection{Design}
Participants were randomly assigned to either the control or the incentive condition.  This was a one-way between-subjects design with two levels. 

\subsubsection{Procedure}
The entire experiment lasted 30 minutes.  Participants were given informed consent, instructions, and the response sheet. At the end of the experiment, they were asked to leave their email address, with the promise that they would be contacted later if they had solved the rule to receive their bonus payment.  This delay of bonus payment was done to prevent participants from telling their friends the correct answer.

\subsubsection{Materials} 
All materials were the same as those in the first study except for the following three changes.  First, participants were given the spreadsheet, but were not required to use it.   

Second, in the financial incentive condition, participants were told: 

\begin{quote}
``At the end of the experiment you will be given a chance to win money by guessing the rule. If you decide to guess the rule you will receive 100 dollars if the guess is exactly correct, but 0 dollars if the guess is incorrect. On the other hand, you can decide that you do not know and receive 1 dollar for sure.''
\end{quote}

Third, decisions to share a trial were made immediately after participants made their error attributions:  

\begin{quote}
``We are also interested in how people share information. In a future experiment, a new participant will try to discover the same Actual Rule that you are trying to discover. You can share information with this new participant to help him or her solve the Actual Rule.  If the new participant solves the rule, you will receive an additional 50 dollars.'' 
\end{quote}

\begin{flushleft}
The trials were described in the same way as Experiment One, but the sharing judgment was now binary:
\end{flushleft}
\begin{quote}
``Do you think this trial should be shared with a new participant? (Yes/No)''
\end{quote}

\subsection{Results}

<<exp2,results=hide,echo=false,fig=false>>=
##Control: Nested Boostrap##
chip<-c()
phip<-c()
phi.p<-c()
glmeraa<-c()
glmerff<-c()
glmershnerr<-c()
glmersherr<-c()
glmersha<-c()
glmershf<-c()
qf<-was2
##For each observation in the dataset make a zero##
qf$new.id<-c(rep(0,length(qf$subjectid)))

for(i in 1:n2){
##obtain the id numbers of subjects we want##
a<-as.numeric(levels(as.factor(qf$subjectid[qf$incentive==0])))
##Sort a sample from these subjects##
a<-sort(sample(a,size=length(levels(as.factor(qf$subjectid[qf$incentive==0]))),replace=TRUE))
##Make some new IDs for the bootstrap sample##
new.id<-seq(length(a))
##Make a data frame holding the old and new ID numbers##
a<-cbind(a,new.id)
a<-data.frame(a)
q1<-data.frame()
####
##For each ID number##
for(j in 1:length(a[,2])){
##The participant with subjectID equal to the jth ID number gets a new id numer equal to j##
  qf$new.id[qf$subjectid==a$a[j]]<-j
  ##Add this new participant to the bootstrapped sample##
q1<-rbind(q1,qf[qf$subjectid==a$a[j],])
} 

glmerfals<-glmer(error~falsification+(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmeraa[i]<-invlogit(fixef(glmerfals)[1])
glmerff[i]<-invlogit(fixef(glmerfals)[1]+fixef(glmerfals)[2])

glmerphi<-glmer(actual.error~error+(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmerphi.null<-glmer(actual.error~(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
nova<-anova(glmerphi.null,glmerphi)
chip[i]<-nova$Chisq[2]
phip[i]<-sqrt(nova$Chisq[2]/length(q1[q1$incentive==0,]$actual.error))
phi.p[i]<-nova[2,7]

glmershfee<-glmer(share~falsification+(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmersha[i]<-invlogit(fixef(glmershfee)[1])
glmershf[i]<-invlogit(fixef(glmershfee)[1]+fixef(glmershfee)[2])

glmershatt<-glmer(share~error+(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmershnerr[i]<-invlogit(fixef(glmershatt)[1])
glmersherr[i]<-invlogit(fixef(glmershatt)[1]+fixef(glmershatt)[2])
}
glmeraa.se<-sd(glmeraa)
glmerff.se<-sd(glmerff)
glmershf.se<-sd(glmershf)
glmersha.se<-sd(glmersha)
glmershnerr.se<-sd(glmershnerr)
glmersherr.se<-sd(glmersherr)
###
##Incentive: Nested Boostrap##
chip.i<-c()
phip.i<-c()
phi.p.i<-c()
glmeraa.i<-c()
glmerff.i<-c()
glmershnerr.i<-c()
glmersherr.i<-c()
glmersha.i<-c()
glmershf.i<-c()
qf<-was2
qf$new.id<-c(rep(0,length(qf$subjectid)))
for(i in 1:n2){
a<-seq(33,58)
a<-sort(sample(a,size=25,replace=TRUE))
new.id<-seq(length(a))
a<-cbind(a,new.id)
a<-data.frame(a)
q1<-data.frame()
for(j in 1:25){
  qf$new.id[qf$subjectid==a$a[j]]<-j
q1<-rbind(q1,qf[qf$subjectid==a$a[j],])
} 
glmerfals.i<-glmer(error~falsification+(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmeraa.i[i]<-invlogit(fixef(glmerfals.i)[1])
glmerff.i[i]<-invlogit(fixef(glmerfals.i)[1]+fixef(glmerfals.i)[2])

glmerphi<-glmer(actual.error~error+(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmerphi.null<-glmer(actual.error~(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
nova<-anova(glmerphi.null,glmerphi)
chip.i[i]<-nova$Chisq[2]
phip.i[i]<-sqrt(nova$Chisq[2]/length(q1[q1$incentive==1,]$actual.error))
phi.p.i[i]<-nova[2,7]

glmershfee.i<-glmer(share~falsification+(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmersha.i[i]<-invlogit(fixef(glmershfee.i)[1])
glmershf.i[i]<-invlogit(fixef(glmershfee.i)[1]+fixef(glmershfee.i)[2])

glmershatt.i<-glmer(share~error+(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmershnerr.i[i]<-invlogit(fixef(glmershatt.i)[1])
glmersherr.i[i]<-invlogit(fixef(glmershatt.i)[1]+fixef(glmershatt.i)[2])
}
glmeraa.se.i<-sd(glmeraa.i)
glmerff.se.i<-sd(glmerff.i)
glmershf.se.i<-sd(glmershf.i)
glmersha.se.i<-sd(glmersha.i)
glmershnerr.se.i<-sd(glmershnerr.i)
glmersherr.se.i<-sd(glmersherr.i)
###
##Control: Hierarchical linear model for correlation between prior (P(TFTR)) and error attribution##
wasqa<-was2[c(was2$incentive==0 & was2$falsification==0),]
wasqa$error.should<-ifelse(wasqa$p.fit<0.2,1,0)

glma<-glm(error.should~error,data=wasqa,family=binomial(link="logit"))
chi2a<-glma$null.deviance-glma$deviance
phia<-sqrt(chi2a/length(wasqa$error.should))

wasqf<-was2[c(was2$incentive==0 & was2$falsification==1),]
wasqf$error.should<-ifelse(wasqf$p.fit>0.8,1,0)

glmf<-glm(error.should~error,data=wasqf,family=binomial(link="logit"))
chi2f<-glmf$null.deviance-glmf$deviance
phif<-sqrt(chi2f/length(wasqf$error.should))

wasq<-rbind(wasqa,wasqf)
glmer<-glmer(error.should~error*falsification+(1|subjectid),data=wasq,family=binomial(link="logit"))

wasq<-rbind(wasqa,wasqf)
glmer.ov<-glmer(error.should~error+(1|subjectid),data=wasq,family=binomial(link="logit"))
glmer.ov.null<-glmer(error.should~(1|subjectid),data=wasq,family=binomial(link="logit"))
nova.ov<-anova(glmer.ov.null,glmer.ov)
chi.ov<-nova.ov$Chisq[2]
phi.ov<-sqrt(chi.ov/length(wasq$actual.error))

####
##Incentive: Hierarchical linear model for correlation between prior (P(TFTR)) and error attribution##
wasqa<-was2[c(was2$incentive==1 & was2$falsification==0),]
wasqa$error.should<-ifelse(wasqa$p.fit<0.2,1,0)
lmera.i<-lmer(error.should~error+(1|subjectid),data=wasqa)

phiaff.i<-glmer(error.should~error+(1|subjectid),data=wasqa,family=binomial(link="logit"))
phiaff.null.i<-glmer(error.should~(1|subjectid),data=wasqa,family=binomial(link="logit"))
affnova.i<-anova(phiaff.null.i,phiaff.i)
affchip.i<-affnova.i$Chisq[2]
affphip.i<-sqrt(affnova.i$Chisq[2]/length(wasqa$error.should))
affphi.p.i<-affnova.i[2,7]


wasqf<-was2[c(was2$incentive==1 & was2$falsification==1),]
wasqf$error.should<-ifelse(wasqf$p.fit>0.8,1,0)
lmerf.i<-lmer(error.should~error+(1|subjectid),data=wasqf)

phidisc.i<-glmer(error.should~error+(1|subjectid),data=wasqf,family=binomial(link="logit"))
phidisc.null.i<-glmer(error.should~(1|subjectid),data=wasqf,family=binomial(link="logit"))
discnova.i<-anova(phidisc.null.i,phidisc.i)
discchip.i<-discnova.i$Chisq[2]
discphip.i<-sqrt(discnova.i$Chisq[2]/length(wasqf$error.should))
discphi.p.i<-discnova.i[2,7]

wasq<-rbind(wasqa,wasqf)
glmer.ov.i<-glmer(error.should~error+(1|subjectid),data=wasq,family=binomial(link="logit"))
glmer.ov.null.i<-glmer(error.should~(1|subjectid),data=wasq,family=binomial(link="logit"))
nova.ov.i<-anova(glmer.ov.null.i,glmer.ov.i)
chi.ov.i<-nova.ov.i$Chisq[2]
phi.ov.i<-sqrt(chi.ov.i/length(wasq$actual.error))
####
##Total Bayes Trials##
bayes.control<-sum(was2$error[was2$falsification==1 & was2$p.fit>0.8 & was2$incentive==0])+sum(was2$error[was2$falsification==0 & was2$p.fit<0.2 & was2$incentive==0])+length(was2$error[was2$falsification==1 & was2$p.fit<0.8 & was2$incentive==0])-sum(was2$error[was2$falsification==1 & was2$p.fit<0.8 & was2$incentive==0])+length(was2$error[was2$falsification==0 & was2$p.fit>0.2 & was2$incentive==0])-sum(was2$error[was2$falsification==0 & was2$p.fit>0.2 & was2$incentive==0])

total.control<-length(was2$error[was2$falsification==1 & was2$p.fit>0.8 & was2$incentive==0])+length(was2$error[was2$falsification==1 & was2$p.fit<0.8 & was2$incentive==0])+length(was2$error[was2$falsification==0 & was2$p.fit<0.2 & was2$incentive==0])+length(was2$error[was2$falsification==0 & was2$p.fit>0.2 & was2$incentive==0])

bayes.incentive<-sum(was2$error[was2$falsification==1 & was2$p.fit>0.8 & was2$incentive==1])+sum(was2$error[was2$falsification==0 & was2$p.fit<0.2 & was2$incentive==1])+length(was2$error[was2$falsification==1 & was2$p.fit<0.8 & was2$incentive==1])-sum(was2$error[was2$falsification==1 & was2$p.fit<0.8 & was2$incentive==1])+length(was2$error[was2$falsification==0 & was2$p.fit>0.2 & was2$incentive==1])-sum(was2$error[was2$falsification==0 & was2$p.fit>0.2 & was2$incentive==1])

total.incentive<-length(was2$error[was2$falsification==1 & was2$p.fit>0.8 & was2$incentive==1])+length(was2$error[was2$falsification==1 & was2$p.fit<0.8 & was2$incentive==1])+length(was2$error[was2$falsification==0 & was2$p.fit<0.2 & was2$incentive==1])+length(was2$error[was2$falsification==0 & was2$p.fit>0.2 & was2$incentive==1])
####
total.trials<-c()
condition<-c()
for(i in levels(as.factor(was2$subjectid))){
total.trials[i]<-max(was2$trial.num[was2$subjectid==i])
condition[i]<-was2$incentive[was2$subjectid==i][1]
}
was2trials<-data.frame(total.trials,condition)
rqr<-wilcox.test(was2trials$total.trials~was2trials$condition)
glmerfalsact<-glmer(error~falsification*actual.error*incentive+(1|subjectid),data=was2,family=binomial(link="logit"))
@ 

\subsubsection{Incentives and Performance}

<<trials,echo=false,fig=false,results=hide>>=
exp1trials<-c(1,1,1,1,1,3,5,6,8,8,9,13,17,18,19,22,41)
exp2trials.c<-c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,3,4,5,5,5,5,6,7,7,7,7,11,11,11,12,13,15,19)
exp2trials.i<-c(1,1,1,2,3,3,3,4,4,5,6,7,8,10,11,11,13,15,16,16,19,19,24,29,38,42)
ks1<-ks.test(exp1trials,exp2trials.c)
ks2<-ks.test(exp2trials.i,exp2trials.c)
@ 

Incentives doubled the median number of trials from \Sexpr{prettyNum(median(was2trials$total.trials[was2trials$condition==0]))} to \Sexpr{prettyNum(median(was2trials$total.trials[was2trials$condition==1]))}.\footnote{Although the median number of trials increased, a non-parametric Kolmogorov-Smirnov (KS) test for differences in empirical cumulative distributions indicates no differences in distribution.  Between Experiment One and the control condition of Experiment Two, the KS test was $D=$ \Sexpr{prettyNum(ks1$statistic)}, $p=$ \Sexpr{prettyNum(ks1$p.value)}.  Between Experiment Two control and incentive conditions, the KS test was $D=$ \Sexpr{prettyNum(ks2$statistic)}, $p=$ \Sexpr{prettyNum(ks2$p.value)}.  Thus, although the medians were different, the distributions of trials between the studies and conditions were similar.}  Using the same scoring method as Experiment One, those in the incentive condition scored about the same on average ($M=$ 1.58, $SD=$ 1.21) as those in the control condition ($M=$ 1.66, $SD=$ 1.21), \emph{t} (56) = 0.80, $p>$ 0.05.  One participant solved the rule exactly, and was compensated with a \$99 Amazon gift card.

\subsubsection{Error Judgments}

As in Experiment One, those in the control condition were significantly more likely to see feedback as error when it was disconfirming (\Sexpr{prettyNum(100*mean(glmerff))}\%, $SE=$ \Sexpr{prettyNum(100*glmerff.se)}\%), than when it was affirming (\Sexpr{prettyNum(100*mean(glmeraa))}\%, $SE=$ \Sexpr{prettyNum(100*glmeraa.se)}\%), $t(171)=$ 2.89, $p<$ 0.05, $d=$ 0.22.  In contrast, participants in the incentive condition were equally likely to attribute error to disconfirming feedback (\Sexpr{prettyNum(100*mean(glmerff.i))}\%, $SE=$ \Sexpr{prettyNum(100*glmerff.se.i)}\%) and to affirming feedback (\Sexpr{prettyNum(100*mean(glmeraa.i))}\%, $SE=$ \Sexpr{prettyNum(100*glmeraa.se.i)}\%), $t(309)=$ 0.62, $p>$ 0.05, $d=$ 0.04.  Thus, although we expected the incentives to increase motivated reasoning, they appeared to reduce the tendency for participants to attribute disconfirming results to error.  In multiple regression, there was a significant main effect of feedback type ($t(476)=$ \Sexpr{prettyNum(fixef(glmerfalsact)[2]/sqrt(diag(vcov(glmerfalsact))[2]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(glmerfalsact)[2]/sqrt(diag(vcov(glmerfalsact))[2]),476)*2))}), incentive ($t(476)=$ \Sexpr{prettyNum(fixef(glmerfalsact)[4]/sqrt(diag(vcov(glmerfalsact))[4]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(glmerfalsact)[4]/sqrt(diag(vcov(glmerfalsact))[4]),476)*2))}), and a significant interaction between the two factors, where disconfirming feedback only increased error attributions for those in the control condition ($t(476)=$ \Sexpr{prettyNum(fixef(glmerfalsact)[6]/sqrt(diag(vcov(glmerfalsact))[6]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(glmerfalsact)[6]/sqrt(diag(vcov(glmerfalsact))[6]),476)*2))}).  There were no other main effects, two-way, or three-way interactions between feedback type, incentive condition, and actual error.

\subsubsection{Bayesian Consistency}

<<was2cons,echo=false,fig=false,results=hide>>=
#install.packages("quantreg")
#install.packages("ggplot2")
library(ggplot2)
library(quantreg)

wasqa<-was2[c(was2$incentive==0 & was2$falsification==0),]
wasqa$error.should<-ifelse(wasqa$p.fit<0.2,1,0)
wasqf<-was2[c(was2$incentive==0 & was2$falsification==1),]
wasqf$error.should<-ifelse(wasqf$p.fit>0.8,1,0)

wasqa.i<-was2[c(was2$incentive==1 & was2$falsification==0),]
wasqa.i$error.should<-ifelse(wasqa.i$p.fit<0.2,1,0)
wasqf.i<-was2[c(was2$incentive==1 & was2$falsification==1),]
wasqf.i$error.should<-ifelse(wasqf.i$p.fit>0.8,1,0)

phiaff<-lm(error~error.should,data=wasqa)
phidisc<-lm(error~error.should,data=wasqf)
phiaff.i<-lm(error~error.should,data=wasqa.i)
phidisc.i<-lm(error~error.should,data=wasqf.i)

hd.new<-data.frame(error.should=1)
hc.new<-data.frame(error.should=0)
fd.new<-data.frame(error.should=1)
fc.new<-data.frame(error.should=0)
hd.pred<-predict(phiaff,hd.new,interval="confidence",level=0.95)
hc.pred<-predict(phiaff,hc.new,interval="confidence",level=0.95)
fd.pred<-predict(phidisc,fd.new,interval="confidence",level=0.95)
fc.pred<-predict(phidisc,fc.new,interval="confidence",level=0.95)
hd.pred.i<-predict(phiaff.i,hd.new,interval="confidence",level=0.95)
hc.pred.i<-predict(phiaff.i,hc.new,interval="confidence",level=0.95)
fd.pred.i<-predict(phidisc.i,fd.new,interval="confidence",level=0.95)
fc.pred.i<-predict(phidisc.i,fc.new,interval="confidence",level=0.95)

png(file="was2cons.png",width=1500,height=1000,res=200)
# Create a simple example dataset 
df <- data.frame(Incentive= factor(c("Control","Control","Control","Control","Incentive","Incentive","Incentive","Incentive")),Hindsight= factor(c("Error","Not Error","Error","Not Error","Error","Not Error","Error","Not Error")),Probability = c(hd.pred[1],hc.pred[1],fd.pred[1],fc.pred[1],hd.pred.i[1],hc.pred.i[1],fd.pred.i[1],fc.pred.i[1]),Feedback = factor(c("Affirming","Affirming","Disconfirming","Disconfirming","Affirming","Affirming","Disconfirming","Disconfirming")),lim.low = c(max(0,hd.pred[2]),max(0,hc.pred[2]),max(0,fd.pred[2]),max(0,fc.pred[2]),max(0,hd.pred.i[2]),max(0,hc.pred.i[2]),max(0,fd.pred.i[2]),max(0,fc.pred.i[2])),lim.high=c(min(1,hd.pred[3]),min(1,hc.pred[3]),min(1,fd.pred[3]),min(1,fc.pred[3]),min(1,hd.pred.i[3]),min(1,hc.pred.i[3]),min(1,fd.pred.i[3]),min(1,fc.pred.i[3])))
# Define the top and bottom of the errorbars 1[was3d$incentive==0,]
limits <- aes(ymax = lim.low, ymin=lim.high) 
p <- ggplot(df, aes(colour=Feedback, y=Probability, x=Hindsight),opts(panel.grid.major = theme_bw() ,panel.grid.minor = theme_bw(),panel.background = theme_bw(),axis.ticks = theme_blank())) 
p + geom_crossbar(limits, width=0.1,position="dodge",size=0.3,linetype=1,alpha=1,fatten=5)+scale_y_continuous(limits = c(0,1))+theme_bw()+ylab("Proportion Attributed to Error")+xlab("Bayes' Rule")+scale_colour_manual(values = c("darkred","darkblue"))+facet_grid(.~Incentive)+opts(legend.position="bottom")
  dev.off()

share.should<-glmer(share~actual.error*error*incentive+(1|subjectid),data=was2,family=binomial(link="logit"))
@

\begin{figure}[h] \pause
    \centering
\scalebox{1.4}{\includegraphics{was2cons}}
\caption[Experiment Two Trials Attributed to Error Compared to Bayes' Rule]{Proportion of trials attributed to error depending on whether Bayes' Rule predicted error attribution and whether the feedback was affirming or disconfirming.}
\end{figure}

As in Experiment One, error attributions for participants in the control condition were consistent with their prior beliefs.  For affirming feedback, they correctly attributed \Sexpr{prettyNum(sum(was2$error[was2$falsification==0 & was2$p.fit<0.2 & was2$incentive==0]))} of \Sexpr{prettyNum(length(was2$error[was2$falsification==0 & was2$p.fit<0.2 & was2$incentive==0]))} trials to error and incorrectly attributed \Sexpr{prettyNum(sum(was2$error[was2$falsification==0 & was2$p.fit>0.2 & was2$incentive==0]))} of \Sexpr{prettyNum(length(was2$error[was2$falsification==0 & was2$p.fit>0.2 & was2$incentive==0]))} trials to error, $\chi^{2}(1) = \Sexpr{prettyNum(chi2a)}$, $p=$ \Sexpr{prettyNum(ptrunc(dchisq(chi2a,1)*2))}, $\phi=$ \Sexpr{prettyNum(phia)}.\footnote{A hierarchical model could not be used for the control condition.  Only one participant both made an error attribution and should have not made an error attribution.  Thus, only one subject-level intercept could be fit, as all other participants had zero probability of judging error.  To deal with this we pool all of the data together to get an approximate answer.}  For disconfirming feedback they correctly attributed \Sexpr{prettyNum(sum(was2$error[was2$falsification==1 & was2$p.fit>0.8 & was2$incentive==0]))} of \Sexpr{prettyNum(length(was2$error[was2$falsification==1 & was2$p.fit>0.8 & was2$incentive==0]))} trials to error and incorrectly attributed \Sexpr{prettyNum(sum(was2$error[was2$falsification==1 & was2$p.fit<0.8 & was2$incentive==0]))} of \Sexpr{prettyNum(length(was2$error[was2$falsification==1 & was2$p.fit<0.8 & was2$incentive==0]))} trials to error, $\chi^{2}(1) = \Sexpr{prettyNum(chi2f)}$, $p<0.05$, $\phi = \Sexpr{prettyNum(phif)}$.  The overall correlation between their error attributions and the consistency criterion was $\phi=$ \Sexpr{prettyNum(phi.ov)}, $\chi^{2}(1)=$ \Sexpr{prettyNum(chi.ov)}, $p=$ \Sexpr{prettyNum(ptrunc(dchisq(chi.ov,1)*2))}.

Participants in the incentive condition exhibited similar consistency.  For affirming feedback, they correctly attributed \Sexpr{prettyNum(sum(was2$error[was2$falsification==0 & was2$p.fit<0.2 & was2$incentive==1]))} of \Sexpr{prettyNum(length(was2$error[was2$falsification==0 & was2$p.fit<0.2 & was2$incentive==1]))} trials to error and incorrectly attributed \Sexpr{prettyNum(sum(was2$error[was2$falsification==0 & was2$p.fit>0.2 & was2$incentive==1]))} of \Sexpr{prettyNum(length(was2$error[was2$falsification==0 & was2$p.fit>0.2 & was2$incentive==1]))} trials to error, $\chi^{2}(1)=$ \Sexpr{prettyNum(affchip.i)}, $p=$ \Sexpr{prettyNum(ptrunc(affphi.p.i))}, $\phi = \Sexpr{prettyNum(affphip.i)}$.  For disconfirming feedback they attributed \Sexpr{prettyNum(sum(was2$error[was2$falsification==1 & was2$p.fit>0.8 & was2$incentive==1]))} of \Sexpr{prettyNum(length(was2$error[was2$falsification==1 & was2$p.fit>0.8 & was2$incentive==1]))} trials to error correctly and incorrectly attributed \Sexpr{prettyNum(sum(was2$error[was2$falsification==1 & was2$p.fit<0.8 & was2$incentive==1]))} of \Sexpr{prettyNum(length(was2$error[was2$falsification==1 & was2$p.fit<0.8 & was2$incentive==1]))} trials to error, $\chi^{2}(1) = \Sexpr{prettyNum(discchip.i)}$, $p<0.05$, $\phi = \Sexpr{prettyNum(discphip.i)}$.  The overall correlation between their error attributions and the consistency criterion was $\phi = \Sexpr{prettyNum(phi.ov.i)}$, $\chi^{2}(1) = \Sexpr{prettyNum(chi.ov.i)}$, $p=\Sexpr{prettyNum(ptrunc(dchisq(chi.ov.i,1)*2))}$.  \footnote{We evaluate coherence with the same methods as in Experiment One.  As in Experiment One, we assume equal value to classification mistakes, so each model predicts error if the posterior probability of error is greater than 50\%, and no error otherwise.
  
  In the control condition, participants again behaved in a coherent fashion.  As the Bayesian model predicts, they attribute the feedback to error when receiving falsification more when they strongly expected the triple to fit the actual rule, and attributed feedback to error less when receiving affirmation when they did not expect the triple to fit the actual rule.  

  However, those in the incentive condition did not make error judgments that were sensitive to their ex ante confidence.  For the incentive condition, the MLE logistic regression predicts that participants will never judge a trial as error, regardless of the prior probability and feedback.  

  Overall, pooling all the data together, the Bayesian ($LL = -82$) and quasi-Bayesian models ($LL = -72$) performed nearly as well as the kernel density estimate ($LL = -63$) and MLE ($LL = -66$) for the control condition.  However, the Bayesian ($LL = -301$) and quasi-Bayesian ($LL = -304$) performed much worse than the MLE logistic regression ($LL = -150$) and kernel density estimate ($LL = -131$) for the incentive condition.  Those in the incentive condition were predicted to never judge the affirming feedback was error for the KDE and MLE models, and similarly for falsifying feedback.  The parameters of the four quasi-Bayesian models were $\alpha_{f-control}=0.27 (95\% CI [0.13, 0.41])$, $\alpha_{a-control}= 0.04 (95\% CI [0.02, 0.07])$, $\alpha_{f-incentive}=0.17 (95\% CI [0.06, 0.28])$, and $\alpha_{a-incentive}= 0.04 (95\% CI [0.00, 0.08])$ for falsification-control, affirmation-control, falsification-incentive, and affirmation-incentive, respectively.}

\subsubsection{Accuracy}

As in Experiment One, participants in the control condition were unable to identify when actual errors occurred.  They correctly identified 26\% of actual errors and incorrectly identified 20\% of non-errors as error, $\chi^{2}(1)=$ \Sexpr{prettyNum(mean(chip))}, $p=$ \Sexpr{prettyNum(ptrunc(mean(phi.p)))}, $\phi=$ \Sexpr{prettyNum(mean(phip))}.  For the incentive condition, participants were also unable to identify actual error.  They correctly identified 21\% of actual errors and incorrectly identified 21\% of non-errors as error, $\chi^{2}(1)=$ \Sexpr{prettyNum(mean(chip.i))}, $p=$ \Sexpr{prettyNum(ptrunc(mean(phi.p.i)))}, $\phi=$ \Sexpr{prettyNum(mean(phip.i))}.

\subsubsection{Data Sharing}

In contrast to Experiment One, participants in the control condition shared a smaller proportion of trials when the feedback was disconfirming (\Sexpr{prettyNum(100*mean(glmershf))}\%, $SE=$ \Sexpr{prettyNum(100*glmershf.se)}\%) than when it was affirming (\Sexpr{prettyNum(100*mean(glmersha))}\%, $SE=$ \Sexpr{prettyNum(100*glmersha.se)}\%), $t(171)$ = 1.96, $p=$ 0.05, $d=$ 0.15.  Similarly, they shared a smaller proportion of trials when they judged the feedback to be an error (\Sexpr{prettyNum(100*mean(glmersherr))}\%, $SE=$ \Sexpr{prettyNum(100*glmersherr.se)}\%) than when they judged it to be accurate (\Sexpr{prettyNum(100*mean(glmershnerr))}\%, $SE=$ \Sexpr{prettyNum(100*glmershnerr.se)}\%), $t(171)=$ 1.98, $p<$ 0.05, $d=$ 0.15.  Participants in the incentive condition also shared a smaller proportion of trials when the feedback was disconfirming (\Sexpr{prettyNum(100*mean(glmershf.i))}\%, $SE=$ \Sexpr{prettyNum(100*glmershf.se.i)}\%), than when it was affirming (\Sexpr{prettyNum(100*mean(glmersha.i))}\%, $SE=$ \Sexpr{prettyNum(100*glmersha.se.i)}\%), $t(309)=$ 2.95, $p<$ 0.05, $d=$ 0.17.  They also shared a smaller proportion of trials when they judged the feedback to be an error (\Sexpr{prettyNum(100*mean(glmersherr.i))}\%, $SE=$ \Sexpr{prettyNum(100*glmersherr.se.i)}\%) than when they judged the feedback to be accurate (\Sexpr{prettyNum(100*mean(glmershnerr.i))}\%, $SE=$ \Sexpr{prettyNum(100*glmershnerr.se.i)}\%), $t(309)=$ 3.94, $p<$ 0.05, $d=$ 0.22.  

In multiple regression, there was only a significant main effect of error attribution ($t(476)=$ \Sexpr{prettyNum(abs(fixef(share.should)[3]/sqrt(diag(vcov(share.should)))[3]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(share.should)[3]/sqrt(diag(vcov(share.should)))[3],476)))}), and a marginally significant interaction between actual error and incentive condition, such that those in the incentive condition were more likely to share actual errors than those in the control condition ($t(476)=$ \Sexpr{prettyNum(abs(fixef(share.should)[6]/sqrt(diag(vcov(share.should)))[6]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(share.should)[6]/sqrt(diag(vcov(share.should)))[6],476)))}).  There were no other main effects, two-way, or three-way interactions between feedback type, actual error, and incentive condition. 

\subsection{Discussion}

Experiment Two again found that participants more often attribute error to disconfirming feedback when given no incentive beyond their intrinsic motivation to solve the problem.  However, participants who were offered a large incentive for getting the rule attributed error to affirming and disconfirming feedback at equal rates.  Although we had expected the incentive for getting the rule to increase motivated reasoning, it actually reduced the tendency for participants to attribute disconfirming feedback to error.  It did not, however, lead to error attributions that were either more accurate or more consistent with prior expectations.  Participants in the control condition met the consistency criterion on \Sexpr{bayes.control} of \Sexpr{total.control} trials (\Sexpr{prettyNum(100*bayes.control/total.control)}\%), which was a higher rate than those in the incentive condition (\Sexpr{bayes.incentive} of \Sexpr{total.incentive} trials, \Sexpr{prettyNum(100*bayes.incentive/total.incentive)}\%).  One possible explanation is that the incentive helped participants maintain a more balanced perspective on the likelihood of error after receiving feedback; however, in spite of their motivation, they lacked the understanding (e.g., of Bayes' Rule) needed to respond consistently. An alternative explanation is that participants in the incentive condition rushed through the prior probability and error attribution questions in order to complete more trials, thereby creating more chances to propose triples and get feedback.  This strategy would reduce consistency and make attributions of error more equal across feedback types, and is consistent with the finding that participants in the incentive condition completed twice as many trials in the same time period as those in the control condition.

For both the control and the incentive groups, participants shared disconfirming feedback less frequently than affirming feedback.  They also shared feedback that they attributed to error less frequently than feedback that they saw as accurate.  Those error attributions were loosely justified by internal consistency, but not by accuracy.  Extrapolating to scientific contexts, researchers may have defensible reasons to omit data from publication based on their expectations, but that this consistency may not prevent harm to those who must use the data.  Before reaching that conclusion, we address one possible artifact in Experiment Two's procedure: placing the sharing decision immediately after the error attribution task, perhaps suggesting that the two should be related.  Experiment Three remedies this possible confound by eliciting data sharing decisions and error attributions both during each trial and at the end of the task, also allowing participants to reflect on all the data before making their final error attributions and data-sharing decisions.

Finally, Experiment Two's incentive scheme sought to motivate participants to believe they knew the rule.  However, the value of data are usually determined not by the person who collects the data themselves, but by others, such as reviewers (for journals) or regulatory bodies (for drug approval).  These people, who are external to the data collection process, determine the reward to the researcher based on their prior beliefs and their evaluation of the data shared with them.  To simulate this incentive system more closely, Experiment Three uses the natural expectations that participants have about how to convince another person.  We expect that an incentive to convince another person should increase the preference for discounting disconfirming feedback.

\section{Experiment Three}

Experiment Three replicates Experiment Two with several modifications.  Most importantly, a new condition provides an incentive for participants to convince another person that their proposed Final Answer is correct, with data-sharing as the sole mode of communication between them.  To do this, we embed the Wason task in a teacher--learner game, a type of principal--agent game \cite{fudenberg1991game,shaftoepistemic}.  In this task, the participant collecting the data (the teacher) shares data with another person (the learner) who has to guess the rule based on the data that the teacher decides to share.  

The teacher is in one of two incentive conditions.  The \emph{compatible} incentive condition rewards both the teacher and learner if the learner guesses the rule.  In the \emph{perverse} incentive condition, the learner's rewards remain the same, but the teacher receives money if the learner accepts the teacher's Final Answer.  Thus, the perverse incentive allows the teacher to distort the data supplied to the learner, potentially increasing her own payoff while reducing the learner's reward.  In this scenario, the teacher knows the entire game structure, but the learner does not.  Specifically, the learner is not told that the teacher does not have to share all the trials that were conducted, and the teacher is told that the learner only knows about the shared trials.

Experiment Three also deals with two methodological issues brought up in Experiment Two.  One is that participants in the incentive condition attributed error to affirmation and disconfirmation equally, but were slightly less consistent in their error attributions than participants in the control condition.  This may have reflected their rushing through the task to complete more trials.  To reduce this threat, we use a penalty for making incorrect prior probability and error attributions.  Any payoff to the participant is reduced in proportion to their inaccuracy on these two measures.  This penalty prevents them from performing one element of the task well (collecting many trials) at the cost of the other elements (rushing through error attributions).  The second was the possibility that participants assumed that the data sharing and error attribution judgments should be related because they occurred sequentially on each trial.  This could create a false correlation between the two measures based on the participant's belief that the experimenter put the two questions close to each other for a reason.  To deal with this, we also elicit data sharing decisions and error attributions at the end of the task, using a fixed-response format rather than the open-ended format used in Experiment One.

\subsection{Method}

<<exp3,results=hide,echo=false,fig=false>>=
a<-(356+cumsum(rep(c(4,2,1,1,1,1,1,1),37)))
b<-c(1,6,8,9,18,21:24,27,31,34:38,41,45,48:52,55,318,320,322,323,326,328:334,337,339:345,348,350:356)
c<-append(b,a)
c<-append(c,c(804,977:1062,1065,1066,1098))
was3.t<-read.csv("Wason_Cumulative_Info.csv",na.string="",skip=2,header=FALSE)
was3<-was3.t[1:length(was3.t[,1]),c]
  
g1<-c("subjectid","IP","start","end","open1","trial1.1","trial1.2","trial1.3","trial1.p.fit","trial1.feedback","open2","trial2.false","trial2.1","trial2.2","trial2.3","trial2.p.fit","trial2.feedback","open3","trial3.false","trial3.1","trial3.2","trial3.3","trial3.p.fit","trial3.feedback","bonus.self.perverse","bonus.other.perverse","bonus.self.compatible","bonus.other.compatible","open4","trial4.feedback","trial4.false","trial4.share","trial4.1","trial4.2","trial4.3","trial4.p.fit","open5","trial5.feedback","trial5.false","trial5.share","trial5.1","trial5.2","trial5.3","trial5.p.fit","open6","trial6.feedback","trial6.false","trial6.share","trial6.1","trial6.2","trial6.3","trial6.p.fit","open7","trial7.feedback","trial7.false","trial7.share","trial7.1","trial7.2","trial7.3","trial7.p.fit","open8","trial8.feedback","trial8.false","trial8.share","trial8.1","trial8.2","trial8.3","trial8.p.fit","open9","trial9.feedback","trial9.false","trial9.share","trial9.1","trial9.2","trial9.3","trial9.p.fit")
g1<-c(g1,c("open10","trial10.feedback","trial10.false","trial10.share","trial10.1","trial10.2","trial10.3","trial10.p.fit","open11","trial11.feedback","trial11.false","trial11.share","trial11.1","trial11.2","trial11.3","trial11.p.fit","open12","trial12.feedback","trial12.false","trial12.share","trial12.1","trial12.2","trial12.3","trial12.p.fit","open13","trial13.feedback","trial13.false","trial13.share","trial13.1","trial13.2","trial13.3","trial13.p.fit","open14","trial14.feedback","trial14.false","trial14.share","trial14.1","trial14.2","trial14.3","trial14.p.fit","open15","trial15.feedback","trial15.false","trial15.share","trial15.1","trial15.2","trial15.3","trial15.p.fit","open16","trial16.feedback","trial16.false","trial16.share","trial16.1","trial16.2","trial16.3","trial16.p.fit","open17","trial17.feedback","trial17.false","trial17.share","trial17.1","trial17.2","trial17.3","trial17.p.fit","open18","trial18.feedback","trial18.false","trial18.share","trial18.1","trial18.2","trial18.3","trial18.p.fit","open19","trial19.feedback","trial19.false","trial19.share","trial19.1","trial19.2","trial19.3","trial19.p.fit","open20","trial20.feedback","trial20.false","trial20.share","trial20.1","trial20.2","trial20.3","trial20.p.fit","open21","trial21.feedback","trial21.false","trial21.share","trial21.1","trial21.2","trial21.3","trial21.p.fit","open22","trial22.feedback","trial22.false","trial22.share","trial22.1","trial22.2","trial22.3","trial22.p.fit","open23","trial23.feedback","trial23.false","trial23.share","trial23.1","trial23.2","trial23.3","trial23.p.fit","open24","trial24.feedback","trial24.false","trial24.share","trial24.1","trial24.2","trial24.3","trial24.p.fit"))
g1<-c(g1,c("open25","trial25.feedback","trial25.false","trial25.share","trial25.1","trial25.2","trial25.3","trial25.p.fit","open26","trial26.feedback","trial26.false","trial26.share","trial26.1","trial26.2","trial26.3","trial26.p.fit","open27","trial27.feedback","trial27.false","trial27.share","trial27.1","trial27.2","trial27.3","trial27.p.fit","open28","trial28.feedback","trial28.false","trial28.share","trial28.1","trial28.2","trial28.3","trial28.p.fit","open29","trial29.feedback","trial29.false","trial29.share","trial29.1","trial29.2","trial29.3","trial29.p.fit","open30","trial30.feedback","trial30.false","trial30.share","trial30.1","trial30.2","trial30.3","trial30.p.fit","open31","trial31.feedback","trial31.false","trial31.share","trial31.1","trial31.2","trial31.3","trial31.p.fit","open32","trial32.feedback","trial32.false","trial32.share","trial32.1","trial32.2","trial32.3","trial32.p.fit","open33","trial33.feedback","trial33.false","trial33.share","trial33.1","trial33.2","trial33.3","trial33.p.fit","open34","trial34.feedback","trial34.false","trial34.share","trial34.1","trial34.2","trial34.3","trial34.p.fit","open35","trial35.feedback","trial35.false","trial35.share","trial35.1","trial35.2","trial35.3","trial35.p.fit","open36","trial36.feedback","trial36.false","trial36.share","trial36.1","trial36.2","trial36.3","trial36.p.fit"))
g1<-c(g1,c("open37","trial37.feedback","trial37.false","trial37.share","trial37.1","trial37.2","trial37.3","trial37.p.fit","open38","trial38.feedback","trial38.false","trial38.share","trial38.1","trial38.2","trial38.3","trial38.p.fit","open39","trial39.feedback","trial39.false","trial39.share","trial39.1","trial39.2","trial39.3","trial39.p.fit","open40","trial40.feedback","trial40.false","trial40.share","trial40.1","trial40.2","trial40.3","trial40.p.fit","open41","trial41.feedback","trial41.false","trial41.share","trial41.1","trial41.2","trial41.3","trial41.p.fit","open42","trial42.feedback","trial42.false","trial42.share","trial42.1","trial42.2","trial42.3","trial42.p.fit","open43","trial43.feedback","trial43.false","trial43.share","trial43.1","trial43.2","trial43.3","trial43.p.fit","final.answer","trial1.false.final","trial2.false.final","trial3.false.final","trial4.false.final","trial5.false.final","trial6.false.final","trial7.false.final","trial8.false.final","trial9.false.final","trial10.false.final"))
g1<-c(g1,c("trial11.false.final","trial12.false.final","trial13.false.final","trial14.false.final","trial15.false.final","trial16.false.final","trial17.false.final","trial18.false.final","trial19.false.final","trial20.false.final","trial21.false.final","trial22.false.final","trial23.false.final","trial24.false.final","trial25.false.final","trial26.false.final","trial27.false.final","trial28.false.final","trial29.false.final","trial30.false.final","trial31.false.final","trial32.false.final","trial33.false.final","trial34.false.final","trial35.false.final","trial36.false.final","trial37.false.final","trial38.false.final","trial39.false.final","trial40.false.final","trial41.false.final","trial42.false.final","trial43.false.final","trial1.share.final","trial2.share.final","trial3.share.final","trial4.share.final","trial5.share.final","trial6.share.final","trial7.share.final","trial8.share.final","trial9.share.final","trial10.share.final","trial11.share.final","trial12.share.final","trial13.share.final","trial14.share.final","trial15.share.final","trial16.share.final","trial17.share.final","trial18.share.final","trial19.share.final","trial20.share.final","trial21.share.final","trial22.share.final","trial23.share.final","trial24.share.final","trial25.share.final","trial26.share.final","trial27.share.final","trial28.share.final","trial29.share.final","trial30.share.final","trial31.share.final","trial32.share.final","trial33.share.final","trial34.share.final","trial35.share.final","trial36.share.final","trial37.share.final","trial38.share.final","trial39.share.final","trial40.share.final","trial41.share.final","trial42.share.final","trial43.share.final","gender","age","incentive"))

colnames(was3)<-g1

b<-matrix(NA,nrow=1,ncol=13)
b<-data.frame(b)
colnames(b)<-c("name","trial","open","feedback","error","share","num1","num2","num3","p.fit","false.final","share.final","incentive")
was3d<-b
was3d<-was3d[-1,]

write.table(was3[,c("final.answer","subjectid")],"was3d.answer.csv",sep=",")
scores<-read.csv("was3d.answer-1.csv",na.string="")
scores<-scores[,c("subjectid","ascending","consecutive","evens","lower.two","Upper.100","score")]

for(i in 1:length(levels(as.factor(was3$subjectid)))){
name<-as.factor(rep(was3$subjectid[i],43))
final.answer<-as.factor(rep(was3$final.answer[i],43))
trial<-seq(1,43)
incentive<-rep(was3$incentive[was3$subjectid[i]],43)

score<-rep(scores$score[scores$subjectid==was3$subjectid[i]],43)
ascending<-rep(scores$ascending[scores$subjectid==was3$subjectid[i]],43)
consecutive<-rep(scores$consecutive[scores$subjectid==was3$subjectid[i]],43)
even<-rep(scores$evens[scores$subjectid==was3$subjectid[i]],43)
lower<-rep(scores$lower.two[scores$subjectid==was3$subjectid[i]],43)
upper<-rep(scores$Upper.100[scores$subjectid==was3$subjectid[i]],43)

open<-c(was3$open1[i],was3$open2[i],was3$open3[i],was3$open4[i],was3$open5[i],was3$open6[i],was3$open7[i],was3$open8[i],was3$open9[i],was3$open10[i],was3$open11[i],was3$open12[i],was3$open13[i],was3$open14[i],was3$open15[i],was3$open16[i],was3$open17[i],was3$open18[i],was3$open19[i],was3$open20[i],was3$open21[i],was3$open22[i],was3$open23[i],was3$open24[i],was3$open25[i],was3$open26[i],was3$open27[i],was3$open28[i],was3$open29[i],was3$open30[i],was3$open31[i],was3$open32[i],was3$open33[i],was3$open34[i],was3$open35[i],was3$open36[i],was3$open37[i],was3$open38[i],was3$open39[i],was3$open40[i],was3$open41[i],was3$open42[i],was3$open43[i])

q<-c("FIT","DNF")

feedback<-c(factor(was3$trial1.feedback[i],levels=q),factor(was3$trial2.feedback[i],levels=q),factor(was3$trial3.feedback[i],levels=q),factor(was3$trial4.feedback[i],levels=q),factor(was3$trial5.feedback[i],levels=q),factor(was3$trial6.feedback[i],levels=q),factor(was3$trial7.feedback[i],levels=q),factor(was3$trial8.feedback[i],levels=q),factor(was3$trial9.feedback[i],levels=q),factor(was3$trial10.feedback[i],levels=q),factor(was3$trial11.feedback[i],levels=q),factor(was3$trial12.feedback[i],levels=q),factor(was3$trial13.feedback[i],levels=q),factor(was3$trial14.feedback[i],levels=q),factor(was3$trial15.feedback[i],levels=q),factor(was3$trial16.feedback[i],levels=q),factor(was3$trial17.feedback[i],levels=q),factor(was3$trial18.feedback[i],levels=q),factor(was3$trial19.feedback[i],levels=q),factor(was3$trial20.feedback[i],levels=q),factor(was3$trial21.feedback[i],levels=q),factor(was3$trial22.feedback[i],levels=q),factor(was3$trial23.feedback[i],levels=q),factor(was3$trial24.feedback[i],levels=q),factor(was3$trial25.feedback[i],levels=q),factor(was3$trial26.feedback[i],levels=q),factor(was3$trial27.feedback[i],levels=q),factor(was3$trial28.feedback[i],levels=q),factor(was3$trial29.feedback[i],levels=q),factor(was3$trial30.feedback[i],levels=q),factor(was3$trial31.feedback[i],levels=q),factor(was3$trial32.feedback[i],levels=q),factor(was3$trial33.feedback[i],levels=q),factor(was3$trial34.feedback[i],levels=q),factor(was3$trial35.feedback[i],levels=q),factor(was3$trial36.feedback[i],levels=q),factor(was3$trial37.feedback[i],levels=q),factor(was3$trial38.feedback[i],levels=q),factor(was3$trial39.feedback[i],levels=q),factor(was3$trial40.feedback[i],levels=q),factor(was3$trial41.feedback[i],levels=q),factor(was3$trial42.feedback[i],levels=q),factor(was3$trial43.feedback[i],levels=q))

error<-c(NA,was3$trial2.false[i],was3$trial3.false[i],was3$trial4.false[i],was3$trial5.false[i],was3$trial6.false[i],was3$trial7.false[i],was3$trial8.false[i],was3$trial9.false[i],was3$trial10.false[i],was3$trial11.false[i],was3$trial12.false[i],was3$trial13.false[i],was3$trial14.false[i],was3$trial15.false[i],was3$trial16.false[i],was3$trial17.false[i],was3$trial18.false[i],was3$trial19.false[i],was3$trial20.false[i],was3$trial21.false[i],was3$trial22.false[i],was3$trial23.false[i],was3$trial24.false[i],was3$trial25.false[i],was3$trial26.false[i],was3$trial27.false[i],was3$trial28.false[i],was3$trial29.false[i],was3$trial30.false[i],was3$trial31.false[i],was3$trial32.false[i],was3$trial33.false[i],was3$trial34.false[i],was3$trial35.false[i],was3$trial36.false[i],was3$trial37.false[i],was3$trial38.false[i],was3$trial39.false[i],was3$trial40.false[i],was3$trial41.false[i],was3$trial42.false[i],was3$trial43.false[i])

share<-c(NA,was3$trial2.share[i],was3$trial3.share[i],was3$trial4.share[i],was3$trial5.share[i],was3$trial6.share[i],was3$trial7.share[i],was3$trial8.share[i],was3$trial9.share[i],was3$trial10.share[i],was3$trial11.share[i],was3$trial12.share[i],was3$trial13.share[i],was3$trial14.share[i],was3$trial15.share[i],was3$trial16.share[i],was3$trial17.share[i],was3$trial18.share[i],was3$trial19.share[i],was3$trial20.share[i],was3$trial21.share[i],was3$trial22.share[i],was3$trial23.share[i],was3$trial24.share[i],was3$trial25.share[i],was3$trial26.share[i],was3$trial27.share[i],was3$trial28.share[i],was3$trial29.share[i],was3$trial30.share[i],was3$trial31.share[i],was3$trial32.share[i],was3$trial33.share[i],was3$trial34.share[i],was3$trial35.share[i],was3$trial36.share[i],was3$trial37.share[i],was3$trial38.share[i],was3$trial39.share[i],was3$trial40.share[i],was3$trial41.share[i],was3$trial42.share[i],was3$trial43.share[i])

num1<-c(was3$trial1.1[i],was3$trial2.1[i],was3$trial3.1[i],was3$trial4.1[i],was3$trial5.1[i],was3$trial6.1[i],was3$trial7.1[i],was3$trial8.1[i],was3$trial9.1[i],was3$trial10.1[i],was3$trial11.1[i],was3$trial12.1[i],was3$trial13.1[i],was3$trial14.1[i],was3$trial15.1[i],was3$trial16.1[i],was3$trial17.1[i],was3$trial18.1[i],was3$trial19.1[i],was3$trial20.1[i],was3$trial21.1[i],was3$trial22.1[i],was3$trial23.1[i],was3$trial24.1[i],was3$trial25.1[i],was3$trial26.1[i],was3$trial27.1[i],was3$trial28.1[i],was3$trial29.1[i],was3$trial30.1[i],was3$trial31.1[i],was3$trial32.1[i],was3$trial33.1[i],was3$trial34.1[i],was3$trial35.1[i],was3$trial36.1[i],was3$trial37.1[i],was3$trial38.1[i],was3$trial39.1[i],was3$trial40.1[i],was3$trial41.1[i],was3$trial42.1[i],was3$trial43.1[i])

num2<-c(was3$trial1.2[i],was3$trial2.2[i],was3$trial3.2[i],was3$trial4.2[i],was3$trial5.2[i],was3$trial6.2[i],was3$trial7.2[i],was3$trial8.2[i],was3$trial9.2[i],was3$trial10.2[i],was3$trial11.2[i],was3$trial12.2[i],was3$trial13.2[i],was3$trial14.2[i],was3$trial15.2[i],was3$trial16.2[i],was3$trial17.2[i],was3$trial18.2[i],was3$trial19.2[i],was3$trial20.2[i],was3$trial21.2[i],was3$trial22.2[i],was3$trial23.2[i],was3$trial24.2[i],was3$trial25.2[i],was3$trial26.2[i],was3$trial27.2[i],was3$trial28.2[i],was3$trial29.2[i],was3$trial30.2[i],was3$trial31.2[i],was3$trial32.2[i],was3$trial33.2[i],was3$trial34.2[i],was3$trial35.2[i],was3$trial36.2[i],was3$trial37.2[i],was3$trial38.2[i],was3$trial39.2[i],was3$trial40.2[i],was3$trial41.2[i],was3$trial42.2[i],was3$trial43.2[i])

num3<-c(was3$trial1.3[i],was3$trial2.3[i],was3$trial3.3[i],was3$trial4.3[i],was3$trial5.3[i],was3$trial6.3[i],was3$trial7.3[i],was3$trial8.3[i],was3$trial9.3[i],was3$trial10.3[i],was3$trial11.3[i],was3$trial12.3[i],was3$trial13.3[i],was3$trial14.3[i],was3$trial15.3[i],was3$trial16.3[i],was3$trial17.3[i],was3$trial18.3[i],was3$trial19.3[i],was3$trial20.3[i],was3$trial21.3[i],was3$trial22.3[i],was3$trial23.3[i],was3$trial24.3[i],was3$trial25.3[i],was3$trial26.3[i],was3$trial27.3[i],was3$trial28.3[i],was3$trial29.3[i],was3$trial30.3[i],was3$trial31.3[i],was3$trial32.3[i],was3$trial33.3[i],was3$trial34.3[i],was3$trial35.3[i],was3$trial36.3[i],was3$trial37.3[i],was3$trial38.3[i],was3$trial39.3[i],was3$trial40.3[i],was3$trial41.3[i],was3$trial42.3[i],was3$trial43.3[i])

p.fit<-c(was3$trial1.p.fit[i],was3$trial2.p.fit[i],was3$trial3.p.fit[i],was3$trial4.p.fit[i],was3$trial5.p.fit[i],was3$trial6.p.fit[i],was3$trial7.p.fit[i],was3$trial8.p.fit[i],was3$trial9.p.fit[i],was3$trial10.p.fit[i],was3$trial11.p.fit[i],was3$trial12.p.fit[i],was3$trial13.p.fit[i],was3$trial14.p.fit[i],was3$trial15.p.fit[i],was3$trial16.p.fit[i],was3$trial17.p.fit[i],was3$trial18.p.fit[i],was3$trial19.p.fit[i],was3$trial20.p.fit[i],was3$trial21.p.fit[i],was3$trial22.p.fit[i],was3$trial23.p.fit[i],was3$trial24.p.fit[i],was3$trial25.p.fit[i],was3$trial26.p.fit[i],was3$trial27.p.fit[i],was3$trial28.p.fit[i],was3$trial29.p.fit[i],was3$trial30.p.fit[i],was3$trial31.p.fit[i],was3$trial32.p.fit[i],was3$trial33.p.fit[i],was3$trial34.p.fit[i],was3$trial35.p.fit[i],was3$trial36.p.fit[i],was3$trial37.p.fit[i],was3$trial38.p.fit[i],was3$trial39.p.fit[i],was3$trial40.p.fit[i],was3$trial41.p.fit[i],was3$trial42.p.fit[i],was3$trial43.p.fit[i])

false.final<-c(was3$trial1.false.final[i],was3$trial2.false.final[i],was3$trial3.false.final[i],was3$trial4.false.final[i],was3$trial5.false.final[i],was3$trial6.false.final[i],was3$trial7.false.final[i],was3$trial8.false.final[i],was3$trial9.false.final[i],was3$trial10.false.final[i],was3$trial11.false.final[i],was3$trial12.false.final[i],was3$trial13.false.final[i],was3$trial14.false.final[i],was3$trial15.false.final[i],was3$trial16.false.final[i],was3$trial17.false.final[i],was3$trial18.false.final[i],was3$trial19.false.final[i],was3$trial20.false.final[i],was3$trial21.false.final[i],was3$trial22.false.final[i],was3$trial23.false.final[i],was3$trial24.false.final[i],was3$trial25.false.final[i],was3$trial26.false.final[i],was3$trial27.false.final[i],was3$trial28.false.final[i],was3$trial29.false.final[i],was3$trial30.false.final[i],was3$trial31.false.final[i],was3$trial32.false.final[i],was3$trial33.false.final[i],was3$trial34.false.final[i],was3$trial35.false.final[i],was3$trial36.false.final[i],was3$trial37.false.final[i],was3$trial38.false.final[i],was3$trial39.false.final[i],was3$trial40.false.final[i],was3$trial41.false.final[i],was3$trial42.false.final[i],was3$trial43.false.final[i])

share.final<-c(was3$trial1.share.final[i],was3$trial2.share.final[i],was3$trial3.share.final[i],was3$trial4.share.final[i],was3$trial5.share.final[i],was3$trial6.share.final[i],was3$trial7.share.final[i],was3$trial8.share.final[i],was3$trial9.share.final[i],was3$trial10.share.final[i],was3$trial11.share.final[i],was3$trial12.share.final[i],was3$trial13.share.final[i],was3$trial14.share.final[i],was3$trial15.share.final[i],was3$trial16.share.final[i],was3$trial17.share.final[i],was3$trial18.share.final[i],was3$trial19.share.final[i],was3$trial20.share.final[i],was3$trial21.share.final[i],was3$trial22.share.final[i],was3$trial23.share.final[i],was3$trial24.share.final[i],was3$trial25.share.final[i],was3$trial26.share.final[i],was3$trial27.share.final[i],was3$trial28.share.final[i],was3$trial29.share.final[i],was3$trial30.share.final[i],was3$trial31.share.final[i],was3$trial32.share.final[i],was3$trial33.share.final[i],was3$trial34.share.final[i],was3$trial35.share.final[i],was3$trial36.share.final[i],was3$trial37.share.final[i],was3$trial38.share.final[i],was3$trial39.share.final[i],was3$trial40.share.final[i],was3$trial41.share.final[i],was3$trial42.share.final[i],was3$trial43.share.final[i])

false.final<-false.final%%2
share.final<-share.final%%2

a<-data.frame(name,trial,open,feedback,error,share,num1,num2,num3,p.fit,false.final,share.final,incentive,score,ascending,consecutive,even,lower,upper,final.answer,stringsAsFactors=FALSE)
was3d<-rbind(was3d,a)
}
 
TFTR<-ifelse(was3d$num1<was3d$num2 & was3d$num2<was3d$num3 & was3d$num1%%2==0 & (was3d$num2-was3d$num1)==2 & (was3d$num3-was3d$num2)==2 & was3d$num1>=2 & was3d$num3<=100,1,0)
was3d<-cbind(was3d,TFTR)
actual.error<-ifelse((was3d$TFTR==1 & was3d$feedback==2) | (was3d$TFTR==0 & was3d$feedback==1),1,0)
was3d<-cbind(was3d,actual.error)
##During task error attributions:1=True; 2=False##
##During task sharing: 1=Yes; 2=No##
##Final error attributions:1=False;2=True##
##Final Sharing:1=Share; 2=dont share####
was3d$error.should<-ifelse(is.na(was3d$feedback)==TRUE,NA,0)
was3d$error.should[was3d$feedback==2 & was3d$p.fit>80]<-1
was3d$error.should[was3d$feedback==1 & was3d$p.fit<20]<-1
was3d$error.should[was3d$feedback==2 & was3d$p.fit==80]<-NA
was3d$error.should[was3d$feedback==1 & was3d$p.fit==20]<-NA

was3d$cons.tp<-ifelse(was3d$error.should==1 & was3d$error==1,1,0)
was3d$cons.tn<-ifelse(was3d$error.should==0 & was3d$error==0,1,0)
was3d$cons.fp<-ifelse(was3d$error.should==0 & was3d$error==1,1,0)
was3d$cons.fn<-ifelse(was3d$error.should==1 & was3d$error==0,1,0)

was3d$subjectid<-as.numeric(was3d$name)
was3d$falsification<-was3d$feedback-1
was3d$incentivef<-as.factor(was3d$incentive)
was3d$incentive<-as.factor(was3d$incentive)
levels(was3d$incentive)<-c(0,1)
was3d$error<-was3d$error-1
was3d$share<-was3d$share%%2

score.comp<-c()
for(i in levels(as.factor(as.character(was3d$subjectid[was3d$incentive==0])))){
  score.comp[i]<-was3d$score[was3d$subjectid==i][1]}
score.perv<-c()
for(i in levels(as.factor(as.character(was3d$subjectid[was3d$incentive==1])))){
  score.perv[i]<-was3d$score[was3d$subjectid==i][1]}

was3d$trial.ascending<-ifelse(was3d$num1<was3d$num2 & was3d$num2<was3d$num3,1,0)
was3d$trial.consecutive<-ifelse(abs(was3d$num1-was3d$num2)==2 & abs(was3d$num2-was3d$num3)==2,1,0)
was3d$trial.even<-ifelse(was3d$num1%%2==0 & was3d$num2%%2==0 & was3d$num3%%2==0,1,0)
was3d$trial.lower<-ifelse(was3d$num1>=2 & was3d$num2>=2 & was3d$num3>=2,1,0)
was3d$trial.upper<-ifelse(was3d$num3<=100 & was3d$num2<=100 & was3d$num1<=100,1,0)
was3d$num.inconsistent<-ifelse(was3d$ascending==was3d$trial.ascending & was3d$consecutive==was3d$trial.consecutive & was3d$even==was3d$trial.even,0,1)
was3d$final.consistent<-ifelse(was3d$num.inconsistent==was3d$falsification,1,0)

final.consistent<-glmer(share.final~final.consistent+(1|subjectid),data=was3d[was3d$incentive==0,],family=binomial(link="logit"))
final.consistent.i<-glmer(share.final~final.consistent+(1|subjectid),data=was3d[was3d$incentive==1,],family=binomial(link="logit"))
final.consistent.t<-glmer(share.final~final.consistent*incentive+(1|subjectid),data=was3d,family=binomial(link="logit"))

unique.id<-was3.t[,c("V1075","V1")]
qqq<-levels(unique.id$V1)
for(i in 1:length(qqq)){
was3d$unique.id[was3d$name==qqq[i]]<-as.character(unique.id$V1075[unique.id$V1==qqq[i]])
}

                  
##Control: Nested Boostrap##
chip<-c()
phip<-c()
phi.p<-c()
chip.final<-c()
phip.final<-c()
phi.p.final<-c()
glmeraa<-c()
glmerff<-c()
glmeraa.final<-c()
glmerff.final<-c()
glmershnerr<-c()
glmersherr<-c()
glmershnerr.final<-c()
glmersherr.final<-c()
glmersha<-c()
glmershf<-c()
tfals<-c()
tfals.final<-c()
tsh<-c()
tsherr<-c()
tsherr.final<-c()
icc.shatt.final<-c()
icc.shatt<-c()
icc.shfee<-c()
glmershatt.final.chi<-c()
glmershatt.final.chip<-c()
glmershfee.chi<-c()
glmershfee.chip<-c()
glmershatt.chi<-c()
glmershatt.chip<-c()
qf<-was3d
##For each observation in the dataset make a zero##
qf$new.id<-c(rep(0,length(qf$subjectid)))
for(i in 1:n3){
##obtain the id numbers of subjects we want##
a<-as.numeric(levels(as.factor(as.character(qf$subjectid[qf$incentive==0]))))
##Sort a sample from these subjects##
a<-sort(sample(a,size=length(a),replace=TRUE))
##Make some new IDs for the bootstrap sample##
new.id<-seq(length(a))
##Make a data frame holding the old and new ID numbers##
a<-cbind(a,new.id)
a<-data.frame(a)
q1<-data.frame()
####
##For each ID number##
for(j in 1:length(a[,2])){
##The participant with subjectID equal to the jth ID number gets a new id numer equal to j##
  qf$new.id[qf$subjectid==a$a[j]]<-j
  ##Add this new participant to the bootstrapped sample##
q1<-rbind(q1,qf[qf$subjectid==a$a[j],])
} 
glmerfals<-glmer(error~falsification+(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmeraa[i]<-invlogit(fixef(glmerfals)[1])
glmerff[i]<-invlogit(fixef(glmerfals)[1]+fixef(glmerfals)[2])
tfals[i]<-fixef(glmerfals)[2]/sqrt(diag(vcov(glmerfals))[2])

glmerfals.final<-glmer(false.final~falsification+(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmeraa.final[i]<-invlogit(fixef(glmerfals.final)[1])
glmerff.final[i]<-invlogit(fixef(glmerfals.final)[1]+fixef(glmerfals.final)[2])
tfals.final[i]<-fixef(glmerfals.final)[2]/sqrt(diag(vcov(glmerfals.final))[2])

glmerphi<-glmer(actual.error~error+(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmerphi.null<-glmer(actual.error~(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
nova<-anova(glmerphi.null,glmerphi)
chip[i]<-nova$Chisq[2]
phip[i]<-sqrt(nova$Chisq[2]/length(na.omit(q1[q1$incentive==0,]$error)))
phi.p[i]<-nova[2,7]

glmerphi<-glmer(actual.error~false.final+(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmerphi.null<-glmer(actual.error~(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
nova<-anova(glmerphi.null,glmerphi)
chip.final[i]<-nova$Chisq[2]
phip.final[i]<-sqrt(nova$Chisq[2]/length(na.omit(q1[q1$incentive==0,]$false.final)))
phi.p.final[i]<-nova[2,7]

glmershfee<-glmer(share~falsification+(falsification|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmershfee.int<-glmer(share~falsification+(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmersha[i]<-invlogit(fixef(glmershfee.int)[1])
glmershf[i]<-invlogit(fixef(glmershfee.int)[1]+fixef(glmershfee.int)[2])
glmershfee.anova<-anova(glmershfee,glmershfee.int)
glmershfee.chi[i]<-glmershfee.anova$Chisq[2]
glmershfee.chip[i]<-glmershfee.anova[2,7]
lmershfee<-lmer(share~falsification+(falsification|new.id),data=q1[q1$incentive==0,])
icc.shfee[i]<-(as.numeric(summary(lmershfee)@REmat[,4][2]))/(as.numeric(summary(lmershfee)@REmat[,4][1])+as.numeric(summary(lmershfee)@REmat[,4][2])+as.numeric(summary(lmershfee)@REmat[,4][3]))
tsh[i]<-abs(fixef(glmershfee)[2]/sqrt(diag(vcov(glmershfee))[2]))

glmershatt<-glmer(share~error+(error|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmershatt.int<-glmer(share~error+(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmershnerr[i]<-invlogit(fixef(glmershatt.int)[1])
glmersherr[i]<-invlogit(fixef(glmershatt.int)[1]+fixef(glmershatt.int)[2])
glmershatt.anova<-anova(glmershatt,glmershatt.int)
glmershatt.chi[i]<-glmershatt.anova$Chisq[2]
glmershatt.chip[i]<-glmershatt.anova[2,7]
lmershatt<-lmer(share~error+(error|new.id),data=q1[q1$incentive==0,])
icc.shatt[i]<-(as.numeric(summary(lmershatt)@REmat[,4][2]))/(as.numeric(summary(lmershatt)@REmat[,4][1])+as.numeric(summary(lmershatt)@REmat[,4][2])+as.numeric(summary(lmershatt)@REmat[,4][3]))
tsherr[i]<-abs(fixef(glmershatt)[2]/sqrt(diag(vcov(glmershatt))[2]))

glmershatt.final<-glmer(share.final~false.final+(false.final|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmershatt.final.int<-glmer(share.final~false.final+(1|new.id),data=q1[q1$incentive==0,],family=binomial(link="logit"))
glmershnerr.final[i]<-invlogit(fixef(glmershatt.final.int)[1])
glmersherr.final[i]<-invlogit(fixef(glmershatt.final.int)[1]+fixef(glmershatt.final.int)[2])
glmershatt.final.anova<-anova(glmershatt.final,glmershatt.final.int)
glmershatt.final.chi[i]<-glmershatt.final.anova$Chisq[2]
glmershatt.final.chip[i]<-glmershatt.final.anova[2,7]
lmershatt.final<-lmer(share.final~false.final+(false.final|new.id),data=q1[q1$incentive==0,])
icc.shatt.final[i]<-(as.numeric(summary(lmershatt.final)@REmat[,4][2]))/(as.numeric(summary(lmershatt.final)@REmat[,4][1])+as.numeric(summary(lmershatt.final)@REmat[,4][2])+as.numeric(summary(lmershatt.final)@REmat[,4][3]))
tsherr.final[i]<-abs(fixef(glmershatt.final)[2]/sqrt(diag(vcov(glmershatt.final))[2]))
}
glmeraa.se<-sd(glmeraa)
glmerff.se<-sd(glmerff)
glmeraa.se.final<-sd(glmeraa.final)
glmerff.se.final<-sd(glmerff.final)
glmershf.se<-sd(glmershf)
glmersha.se<-sd(glmersha)
glmershnerr.se<-sd(glmershnerr)
glmersherr.se<-sd(glmersherr)
glmershnerr.se.final<-sd(glmershnerr.final)
glmersherr.se.final<-sd(glmersherr.final)
####
####Incentive Nested Boostrap##
chip.i<-c()
phip.i<-c()
phi.p.i<-c()
chip.i.final<-c()
phip.i.final<-c()
phi.p.i.final<-c()
glmeraa.i<-c()
glmerff.i<-c()
glmeraa.final.i<-c()
glmerff.final.i<-c()
glmershnerr.i<-c()
glmersherr.i<-c()
glmershnerr.final.i<-c()
glmersherr.final.i<-c()
glmshnerr.final.i<-c()
glmsherr.final.i<-c()
glmersha.i<-c()
glmershf.i<-c()
tfals.i<-c()
tfals.final.i<-c()
tsh.i<-c()
tsherr.i<-c()
tsherr.final.i<-c()
icc.shatt.i<-c()
icc.shfee.i<-c()
icc.shatt.final.i<-c()
glmershatt.final.i.chi<-c()
glmershatt.final.i.chip<-c()
glmershatt.i.chi<-c()
glmershatt.i.chip<-c()
glmershfee.i.chi<-c()
glmershfee.i.chip<-c()
qf<-was3d
##For each observation in the dataset make a zero##
qf$new.id<-c(rep(0,length(qf$subjectid)))
for(i in 1:n3){
a<-as.numeric(levels(as.factor(as.character(qf$subjectid[qf$incentive==1]))))
##Sort a sample from these subjects##
a<-sort(sample(a,size=length(a),replace=TRUE))
##Make some new IDs for the bootstrap sample##
new.id<-seq(length(a))
##Make a data frame holding the old and new ID numbers##
a<-cbind(a,new.id)
a<-data.frame(a)
q1<-data.frame()
####
##For each ID number##
for(j in 1:length(a[,2])){
##The participant with subjectID equal to the jth ID number gets a new id numer equal to j##
  qf$new.id[qf$subjectid==a$a[j]]<-j
  ##Add this new participant to the bootstrapped sample##
q1<-rbind(q1,qf[qf$subjectid==a$a[j],])
} 

glmerfals.i<-glmer(error~falsification+(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmeraa.i[i]<-invlogit(fixef(glmerfals.i)[1])
glmerff.i[i]<-invlogit(fixef(glmerfals.i)[1]+fixef(glmerfals.i)[2])
tfals.i[i]<-fixef(glmerfals.i)[2]/sqrt(diag(vcov(glmerfals.i))[2])

glmerfals.final.i<-glmer(false.final~falsification+(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmeraa.final.i[i]<-invlogit(fixef(glmerfals.final.i)[1])
glmerff.final.i[i]<-invlogit(fixef(glmerfals.final.i)[1]+fixef(glmerfals.final.i)[2])
tfals.final.i[i]<-fixef(glmerfals.final.i)[2]/sqrt(diag(vcov(glmerfals.final.i))[2])

glmerphi<-glmer(actual.error~error+(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmerphi.null<-glmer(actual.error~(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
nova<-anova(glmerphi.null,glmerphi)
chip.i[i]<-nova$Chisq[2]
phip.i[i]<-sqrt(nova$Chisq[2]/length(na.omit(q1[q1$incentive==1,]$error)))
phi.p.i[i]<-nova[2,7]

glmerphi<-glmer(actual.error~false.final+(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmerphi.null<-glmer(actual.error~(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
nova<-anova(glmerphi.null,glmerphi)
chip.i.final[i]<-nova$Chisq[2]
phip.i.final[i]<-sqrt(nova$Chisq[2]/length(na.omit(q1[q1$incentive==1,]$false.final)))
phi.p.i.final[i]<-nova[2,7]

glmershfee.i<-glmer(share~falsification+(falsification|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmershfee.i.int<-glmer(share~falsification+(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmersha.i[i]<-invlogit(fixef(glmershfee.i.int)[1])
glmershf.i[i]<-invlogit(fixef(glmershfee.i.int)[1]+fixef(glmershfee.i.int)[2])
glmershfee.i.anova<-anova(glmershfee.i,glmershfee.i.int)
glmershfee.i.chi[i]<-glmershfee.i.anova$Chisq[2]
glmershfee.i.chip[i]<-glmershfee.i.anova[2,7]
lmershfee.i<-lmer(share~falsification+(falsification|new.id),data=q1[q1$incentive==1,])
icc.shfee.i[i]<-(as.numeric(summary(lmershfee.i)@REmat[,4][2]))/(as.numeric(summary(lmershfee.i)@REmat[,4][1])+as.numeric(summary(lmershfee.i)@REmat[,4][2])+as.numeric(summary(lmershfee.i)@REmat[,4][3]))
tsh.i[i]<-abs(fixef(glmershfee.i)[2]/sqrt(diag(vcov(glmershfee.i))[2]))

glmershatt.i<-glmer(share~error+(error|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmershatt.i.int<-glmer(share~error+(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmershnerr.i[i]<-invlogit(fixef(glmershatt.i.int)[1])
glmersherr.i[i]<-invlogit(fixef(glmershatt.i.int)[1]+fixef(glmershatt.i.int)[2])
glmershatt.i.anova<-anova(glmershatt.i,glmershatt.i.int)
glmershatt.i.chi[i]<-glmershatt.i.anova$Chisq[2]
glmershatt.i.chip[i]<-glmershatt.i.anova[2,7]
lmershatt.i<-lmer(share~error+(error|new.id),data=q1[q1$incentive==1,])
icc.shatt.i[i]<-(as.numeric(summary(lmershatt.i)@REmat[,4][2]))/(as.numeric(summary(lmershatt.i)@REmat[,4][1])+as.numeric(summary(lmershatt.i)@REmat[,4][2])+as.numeric(summary(lmershatt.i)@REmat[,4][3]))
tsherr.i[i]<-abs(fixef(glmershatt.i)[2]/sqrt(diag(vcov(glmershatt.i))[2]))

glmershatt.final.i<-glmer(share.final~false.final+(false.final|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmershatt.final.i.int<-glmer(share.final~false.final+(1|new.id),data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmershatt.final.i.anova<-anova(glmershatt.final.i,glmershatt.final.i.int)
glmershatt.final.i.chi[i]<-glmershatt.final.i.anova$Chisq[2]
glmershatt.final.i.chip[i]<-glmershatt.final.i.anova[2,7]
glmershnerr.final.i[i]<-invlogit(fixef(glmershatt.final.i.int)[1])
glmersherr.final.i[i]<-invlogit(fixef(glmershatt.final.i.int)[1]+fixef(glmershatt.final.i.int)[2])

tsherr.final.i[i]<-abs(fixef(glmershatt.final.i)[2]/sqrt(diag(vcov(glmershatt.final.i))[2]))
glmshatt.final.i<-glm(share.final~false.final,data=q1[q1$incentive==1,],family=binomial(link="logit"))
glmshnerr.final.i[i]<-invlogit(coef(glmshatt.final.i)[1])
glmsherr.final.i[i]<-invlogit(coef(glmshatt.final.i)[1]+coef(glmshatt.final.i)[2])

lmershatt.final.i<-lmer(share.final~false.final+(false.final|new.id),data=q1[q1$incentive==1,])
icc.shatt.final.i[i]<-(as.numeric(summary(lmershatt.final.i)@REmat[,4][2]))/(as.numeric(summary(lmershatt.final.i)@REmat[,4][1])+as.numeric(summary(lmershatt.final.i)@REmat[,4][2])+as.numeric(summary(lmershatt.final.i)@REmat[,4][3]))
}

glmeraa.se.i<-sd(glmeraa.i)
glmerff.se.i<-sd(glmerff.i)
glmeraa.se.final.i<-sd(glmeraa.final.i)
glmerff.se.final.i<-sd(glmerff.final.i)
glmershf.se.i<-sd(glmershf.i)
glmersha.se.i<-sd(glmersha.i)
glmershnerr.se.i<-sd(glmershnerr.i)
glmersherr.se.i<-sd(glmersherr.i)
glmershnerr.se.final.i<-sd(glmershnerr.final.i)
glmersherr.se.final.i<-sd(glmersherr.final.i)
glmsherr.se.final.i<-sd(glmsherr.final.i)
                       
####
##Control: Hierarchical linear model for correlation between prior (P(TFTR)) and error attribution##
glmer.ov<-glmer(error.should~error+(1|subjectid),data=was3d[was3d$incentive==0,],family=binomial(link="logit"))
glmer.ov.null<-glmer(error.should~(1|subjectid),data=was3d[was3d$incentive==0,],family=binomial(link="logit"))
nova.ov<-anova(glmer.ov.null,glmer.ov)
chi.ov<-nova.ov$Chisq[2]
phi.ov<-sqrt(chi.ov/length(na.omit(was3d$actual.error[was3d$incentive==0])))
####
##As a check: phi at the participant level for people who it could be calculated###
cons.chi<-c()
cons.denom<-c()
cons.phi<-c()
for(i in levels(as.factor(as.character(was3d$subjectid)))){
cons.chi[i]<-sum(na.omit(was3d$cons.tp[was3d$subjectid==i]))*sum(na.omit(was3d$cons.tn[was3d$subjectid==i]))-sum(na.omit(was3d$cons.fp[was3d$subjectid==i]))*sum(na.omit(was3d$cons.fn[was3d$subjectid==i]))
cons.denom[i]<-sqrt((sum(na.omit(was3d$cons.tp[was3d$subjectid==i]))+sum(na.omit(was3d$cons.fp[was3d$subjectid==i])))*(sum(na.omit(was3d$cons.fn[was3d$subjectid==i]))+sum(na.omit(was3d$cons.tn[was3d$subjectid==i])))*(sum(na.omit(was3d$cons.fp[was3d$subjectid==i]))+sum(na.omit(was3d$cons.tn[was3d$subjectid==i])))*(sum(na.omit(was3d$cons.tp[was3d$subjectid==i]))+sum(na.omit(was3d$cons.fn[was3d$subjectid==i]))))
cons.phi[i]<-cons.chi[i]/cons.denom[i]
}
####
##Incentive: Hierarchical linear model for correlation between prior (P(TFTR)) and error attribution##

glmer.ov.i<-glmer(error.should~error+(1|subjectid),data=was3d[was3d$incentive==1,],family=binomial(link="logit"))
glmer.ov.null.i<-glmer(error.should~(1|subjectid),data=was3d[was3d$incentive==1,],family=binomial(link="logit"))
nova.ov.i<-anova(glmer.ov.null.i,glmer.ov.i)
chi.ov.i<-nova.ov.i$Chisq[2]
phi.ov.i<-sqrt(chi.ov.i/length(na.omit(was3d$actual.error[was3d$incentive==1])))

####
##Total Bayes Trials##
bayes.control<-sum(was3d$error[was3d$falsification==1 & was3d$p.fit>0.8 & was3d$incentive==0])+sum(was3d$error[was3d$falsification==0 & was3d$p.fit<0.2 & was3d$incentive==0])+length(was3d$error[was3d$falsification==1 & was3d$p.fit<0.8 & was3d$incentive==0])-sum(was3d$error[was3d$falsification==1 & was3d$p.fit<0.8 & was3d$incentive==0])+length(was3d$error[was3d$falsification==0 & was3d$p.fit>0.2 & was3d$incentive==0])-sum(was3d$error[was3d$falsification==0 & was3d$p.fit>0.2 & was3d$incentive==0])

total.control<-length(was3d$error[was3d$falsification==1 & was3d$p.fit>0.8 & was3d$incentive==0])+length(was3d$error[was3d$falsification==1 & was3d$p.fit<0.8 & was3d$incentive==0])+length(was3d$error[was3d$falsification==0 & was3d$p.fit<0.2 & was3d$incentive==0])+length(was3d$error[was3d$falsification==0 & was3d$p.fit>0.2 & was3d$incentive==0])

bayes.incentive<-sum(was3d$error[was3d$falsification==1 & was3d$p.fit>0.8 & was3d$incentive==1])+sum(was3d$error[was3d$falsification==0 & was3d$p.fit<0.2 & was3d$incentive==1])+length(was3d$error[was3d$falsification==1 & was3d$p.fit<0.8 & was3d$incentive==1])-sum(was3d$error[was3d$falsification==1 & was3d$p.fit<0.8 & was3d$incentive==1])+length(was3d$error[was3d$falsification==0 & was3d$p.fit>0.2 & was3d$incentive==1])-sum(was3d$error[was3d$falsification==0 & was3d$p.fit>0.2 & was3d$incentive==1])

total.incentive<-length(was3d$error[was3d$falsification==1 & was3d$p.fit>0.8 & was3d$incentive==1])+length(was3d$error[was3d$falsification==1 & was3d$p.fit<0.8 & was3d$incentive==1])+length(was3d$error[was3d$falsification==0 & was3d$p.fit<0.2 & was3d$incentive==1])+length(was3d$error[was3d$falsification==0 & was3d$p.fit>0.2 & was3d$incentive==1])
###
##Trials##
was3d$complete<-ifelse(is.na(was3d$feedback)==FALSE,1,0)
trials<-c()
incentive<-c()
for(i in 1:length(levels(as.factor(was3d$subjectid)))){
trials[i]<-sum(was3d$complete[was3d$subjectid==i])
incentive[i]<-was3d$incentive[was3d$subjectid==i][1]
}
trialsd<-data.frame(cbind(trials,incentive))
trial.test<-t.test(trialsd$trials~trialsd$incentive)
@ 

\subsubsection{Participants}  
One hundred Amazon Mturk volunteers completed the task for \$5. There were 46 women, with average age of \Sexpr{prettyNum(mean(was3$age,na.rm=TRUE))} years (range: \Sexpr{min(was3$age[was3$age!=2],na.rm=TRUE)}--\Sexpr{max(was3$age,na.rm=TRUE)}).

\subsubsection{Design}
The design was a 2 level (perverse or compatible incentive) between-subjects design.

\subsubsection{Materials}
The procedure and materials were the same as in Experiment Two except for the following modifications.  First, participants completed three `practice trials' to help them understand the task.  They were then told the following: 

\begin{quote}
 ``We are also interested in how people share information.  The information comes in trials.  A trial is a page where you proposed a triple and received feedback.  The practice trials you conducted are shown below.  For each trial you share, another person will get the triple you proposed and the feedback you received.  The person will also receive the Final Answer you propose at the end of the task, regardless of the trials you share.''
\end{quote}

\begin{flushleft}
Participants were then told about possible bonus money:
\end{flushleft}

\begin{quote}
``Both you and the person you share trials with can earn up to a \$5 bonus in addition to the \$5 you receive for participating in the experiment.'' 
\end{quote}

\begin{flushleft}
The perverse incentive condition was followed with this text:
\end{flushleft}
\begin{quote}
``How you earn bonus money:
\begin{itemize}
  \item If the other person thinks your Final Answer matches the Actual Rule exactly, then you get \$5.
  \item If the other person thinks your Final Answer does not match the Actual Rule at all, then you get \$0.
  \item If the other person thinks your Final Answer somewhat matches the Actual Rule, then you get somewhere between \$0 and \$5.''
\end{itemize}
\end{quote}
\begin{quote}
``How the person you are sharing trials with earns bonus money: \newline
The person you are sharing trials with can also earn money.
\begin{itemize}
  \item This person gets the most money (\$5) by correctly judging how well your Final Answer matches the Actual Rule.
  \item If this person thinks your Final Answer matches the Actual Rule, but it does not, the other person gets less money. 
  \item If this person thinks your Final Answer does not match the Actual Rule, but it is does, the other person gets less money.''
    \end{itemize}
\end{quote}

\begin{flushleft}
Those in the compatible incentive condition were told:
\end{flushleft}

\begin{quote}
  \begin{itemize}
  \item ``If the other person's guess matches the Actual Rule exactly, then you both get \$5.
  \item If the other person's guess does not match the Actual Rule at all, then you both get \$0.
  \item If the other person's guess somewhat matches the Actual Rule, then you both get somewhere between \$0 and \$5.''
  \end{itemize}
  \end{quote}

\begin{flushleft}
Finally, participants were told the penalty for making incorrect attributions:
\end{flushleft}
\begin{quote}
``Penalty for wrong answers

Any bonus you get will be reduced if your false feedback and probability judgments are wrong. Thus, to earn the most money you should make your false feedback and probability judgments as accurate as possible.''
\end{quote}

\subsection{Results}

\subsubsection{Incentives and Performance}

As in Experiment Two, participants in the compatible and perverse incentive conditions completed a median of about 8 trials (\Sexpr{prettyNum(median(trialsd$trials[trialsd$incentive==1]))} and \Sexpr{prettyNum(median(trialsd$trials[trialsd$incentive==2]))}, respectively), $t(97)=$ \Sexpr{prettyNum(trial.test$statistic)}, $p=$ \Sexpr{prettyNum(ptrunc(trial.test$p.value))}, $d=$ \Sexpr{prettyNum(trial.test$statistic/sqrt(99))}.%$  Using the same scoring method as before, those in the compatible incentive condition scored about the same ($M=$ \Sexpr{prettyNum(mean(score.comp,na.rm=TRUE))}, $SD=$ \Sexpr{prettyNum(sd(score.comp,na.rm=TRUE))}) as those in the perverse incentive condition ($M=$ \Sexpr{prettyNum(mean(score.perv,na.rm=TRUE))}, $SD=$ \Sexpr{prettyNum(sd(score.perv,na.rm=TRUE))}), $t(98)=$ \Sexpr{prettyNum(t.test(score.comp,score.perv)$statistic)}, $p=$ \Sexpr{prettyNum(ptrunc(t.test(score.comp,score.perv)$p.value))}. %$  

\subsubsection{Error Judgments}

Both during (\Sexpr{prettyNum(100*mean(glmerff))}\% vs. \Sexpr{prettyNum(100*mean(glmeraa))}\%) and at the end of the task (\Sexpr{prettyNum(100*mean(glmerff.final))}\% vs. \Sexpr{prettyNum(100*mean(glmeraa.final))}\%), those in the compatible incentive condition were more likely to see feedback as in error when it was disconfirming than when it was affirming, ($t(510)=$ \Sexpr{prettyNum(mean(tfals))}, $p<$ \Sexpr{prettyNum(ptrunc(dt(mean(tfals),510)*2))}, $d=$ \Sexpr{prettyNum(mean(tfals)/sqrt(513))}; $t(525)=$ \Sexpr{prettyNum(mean(tfals.final))}, $p<$ \Sexpr{prettyNum(ptrunc(dt(mean(tfals.final),525)*2))}, $d=$ \Sexpr{prettyNum(mean(tfals.final)/sqrt(528))}, respectively).  Similarly, both during (\Sexpr{prettyNum(100*mean(glmerff.i))}\% vs. \Sexpr{prettyNum(100*mean(glmeraa.i))}\%) and at the end of the task (\Sexpr{prettyNum(100*mean(glmerff.final.i))}\% vs. \Sexpr{prettyNum(100*mean(glmeraa.final.i))}\%), those in the perverse incentive condition were significantly more likely to see feedback as in error when it was disconfirming than when it was affirming ($t(537)=$ \Sexpr{prettyNum(mean(tfals.i))}, $p<$ \Sexpr{prettyNum(ptrunc(dt(mean(tfals.i),537)*2))}, \emph{d} = \Sexpr{prettyNum(mean(tfals.i)/sqrt(540))}; $t(504)=$ \Sexpr{prettyNum(mean(tfals.final.i))}, $p<$ \Sexpr{prettyNum(ptrunc(dt(mean(tfals.final.i),504)*2))}, $d=$ \Sexpr{prettyNum(mean(tfals.final.i)/sqrt(507))}, respectively). 

\subsubsection{Bayesian Consistency}

For both incentive groups, adding the penalty for incorrect error attributions and probability judgments greatly improved accuracy and consistency, as compared to Experiments One and Two.  For the compatible condition, the overall correlation between their error attributions and the consistency criterion was $\phi=$ \Sexpr{prettyNum(phi.ov)}, $\chi^{2}(1)=$ \Sexpr{prettyNum(chi.ov)}, $p<$ \Sexpr{prettyNum(ptrunc(dchisq(chi.ov,1)*2))}.  Participants in the perverse incentive condition exhibited even greater consistency, $\phi=$ \Sexpr{prettyNum(phi.ov.i)}, $\chi^{2}(1)=$ \Sexpr{prettyNum(chi.ov.i)}, $p<$ \Sexpr{prettyNum(ptrunc(dchisq(chi.ov.i,1)*2))}.\footnote{For affirming feedback in the compatible condition, they correctly attributed \Sexpr{prettyNum(sum(na.omit(was3d$error[was3d$falsification==0 & was3d$p.fit<0.2 & was3d$incentive==0])))} of \Sexpr{prettyNum(length(na.omit(was3d$error[was3d$falsification==0 & was3d$p.fit<0.2 & was3d$incentive==0])))} trials to error, and incorrectly attributed \Sexpr{prettyNum(sum(na.omit(was3d$error[was3d$falsification==0 & was3d$p.fit>0.2 & was3d$incentive==0])))} of \Sexpr{prettyNum(length(na.omit(was3d$error[was3d$falsification==0 & was3d$p.fit>0.2 & was3d$incentive==0])))} trials to error, $\chi^{2}(1)=$ \Sexpr{prettyNum(chi2a)}, $p<$ \Sexpr{prettyNum(ptrunc(dchisq(chi2a,1)*2))}, $\phi=$ \Sexpr{prettyNum(phia)}.  For disconfirming feedback they correctly attributed \Sexpr{prettyNum(sum(na.omit(was3d$error[was3d$falsification==1 & was3d$p.fit>0.8 & was3d$incentive==0])))} of \Sexpr{prettyNum(length(na.omit(was3d$error[was3d$falsification==1 & was3d$p.fit>0.8 & was3d$incentive==0])))} trials to error, and incorrectly attributed \Sexpr{prettyNum(sum(na.omit(was2$error[was3d$falsification==1 & was3d$p.fit<0.8 & was3d$incentive==0])))} of \Sexpr{prettyNum(length(na.omit(was3d$error[was3d$falsification==1 & was3d$p.fit<0.8 & was3d$incentive==0])))} trials to error, $\chi^{2}(1)=$ \Sexpr{prettyNum(chi2f)}, $p<0.05$, $\phi=$ \Sexpr{prettyNum(phif)}.  For affirming feedback in the perverse condition, they correctly attributed \Sexpr{prettyNum(sum(na.omit(was3d$error[was3d$falsification==0 & was3d$p.fit<0.2 & was3d$incentive==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$error[was3d$falsification==0 & was3d$p.fit<0.2 & was3d$incentive==1])))} trials to error, and incorrectly attributed \Sexpr{prettyNum(sum(na.omit(was3d$error[was3d$falsification==0 & was3d$p.fit>0.2 & was3d$incentive==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$error[was3d$falsification==0 & was3d$p.fit>0.2 & was3d$incentive==1])))} trials to error, $\chi^{2}(1)=$ \Sexpr{prettyNum(affchip.i)}, $p=$ \Sexpr{prettyNum(ptrunc(affphi.p.i))}, $\phi=$ \Sexpr{prettyNum(affphip.i)}.  For disconfirming feedback they attributed \Sexpr{prettyNum(sum(na.omit(was3d$error[was3d$falsification==1 & was3d$p.fit>0.8 & was3d$incentive==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$error[was3d$falsification==1 & was3d$p.fit>0.8 & was3d$incentive==1])))} trials to error correctly, and incorrectly attributed \Sexpr{prettyNum(sum(na.omit(was3d$error[was3d$falsification==1 & was3d$p.fit<0.8 & was3d$incentive==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$error[was3d$falsification==1 & was3d$p.fit<0.8 & was3d$incentive==1])))} trials to error, $\chi^{2}(1) = \Sexpr{prettyNum(discchip.i)}$, $p<0.05$, $\phi=$ \Sexpr{prettyNum(discphip.i)}.}  

\subsubsection{Accuracy}
Participants both in the compatible and perverse incentive conditions were able to accurately identify error during the task ($\chi^{2}(1) = \Sexpr{prettyNum(mean(chip))}$, $p<$ \Sexpr{prettyNum(ptrunc(mean(phi.p)))}, $\phi = \Sexpr{prettyNum(mean(phip))}$; $\chi^{2}(1)=$ \Sexpr{prettyNum(mean(chip.i))}, $p<$ \Sexpr{prettyNum(ptrunc(mean(phi.p.i)))}, $\phi=$ \Sexpr{prettyNum(mean(phip.i))}, respectively).  Participants in the compatible incentive group correctly identified \Sexpr{prettyNum(sum(na.omit(was3d$error[was3d$actual.error==1 & was3d$incentive==0])))} of \Sexpr{prettyNum(sum(na.omit(was3d$actual.error[was3d$incentive==0])))} actual errors and incorrectly identified \Sexpr{prettyNum(sum(na.omit(was3d$error[was3d$actual.error==0 & was3d$incentive==0])))} of \Sexpr{prettyNum(length(na.omit(was3d$feedback[was3d$actual.error==0 & was3d$incentive==0])))} non-errors as error.  For the perverse incentive condition, participants correctly identified \Sexpr{prettyNum(sum(na.omit(was3d$error[was3d$actual.error==1 & was3d$incentive==1])))} of \Sexpr{prettyNum(sum(na.omit(was3d$actual.error[was3d$incentive==1])))} actual errors and incorrectly identified \Sexpr{prettyNum(sum(na.omit(was3d$error[was3d$actual.error==0 & was3d$incentive==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$feedback[was3d$actual.error==0 & was3d$incentive==1])))} non-errors as error.  This accuracy also slightly improved in judgments made at the end of the task for both the compatible and perverse incentive conditions ($\chi^{2}(1) = \Sexpr{prettyNum(mean(chip.final))}$, $p<$ \Sexpr{prettyNum(ptrunc(mean(phi.p.final)))}, $\phi = \Sexpr{prettyNum(mean(phip.final))}$; $\chi^{2}(1) = \Sexpr{prettyNum(mean(chip.i.final))}$, $p<$ \Sexpr{prettyNum(ptrunc(mean(phi.p.i.final)))}, $\phi = \Sexpr{prettyNum(mean(phip.i.final))}$, respectively).

\subsubsection{Data Sharing}
Participants in the compatible incentive condition shared \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$falsification==1 & was3d$incentive==0 & was3d$share==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$falsification==1 & was3d$incentive==0])))} trials when the feedback was disconfirming $(\Sexpr{prettyNum(100*mean(glmershf))}\%, SE = \Sexpr{prettyNum(100*glmershf.se)}\%)$ and \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$falsification==0 & was3d$incentive==0 & was3d$share==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$falsification==0 & was3d$incentive==0])))} when it was affirming $(\Sexpr{prettyNum(100*mean(glmersha))}\%, SE = \Sexpr{prettyNum(100*glmersha.se)}\%)$, $t(499)$ = \Sexpr{prettyNum(mean(tsh))}, $p<$ \Sexpr{prettyNum(ptrunc(dt(mean(tsh),499)*2))}, $\emph{d} = \Sexpr{prettyNum(mean(tsh)/sqrt(502))}$.  Similarly, they shared \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$error==1 & was3d$incentive==0 & was3d$share==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$error==1 & was3d$incentive==0])))} trials when they attributed feedback to error $(\Sexpr{prettyNum(100*mean(glmersherr))}\%, SE = \Sexpr{prettyNum(100*glmersherr.se)}\%)$ and \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$error==0 & was3d$incentive==0 & was3d$share==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$error==0 & was3d$incentive==0])))} when they judged it to be accurate $(\Sexpr{prettyNum(100*mean(glmershnerr))}\%, SE = \Sexpr{prettyNum(100*glmershnerr.se)}\%)$, $t(570)$ = \Sexpr{prettyNum(mean(tsherr))}, $p<$ \Sexpr{prettyNum(ptrunc(dt(mean(tsherr),570)*2))}, $\emph{d} = \Sexpr{prettyNum(mean(tsherr)/sqrt(573))}$.  

At the end of the task, participants shared \Sexpr{prettyNum(length(na.omit(was3d$share.final[was3d$false.final==1 & was3d$incentive==0 & was3d$share.final==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$share.final[was3d$false.final==1 & was3d$incentive==0])))} trials that they judged to be an error $(\Sexpr{prettyNum(100*mean(glmersherr.final))}\%, SE = \Sexpr{prettyNum(100*glmersherr.se.final)}\%)$ and \Sexpr{prettyNum(length(na.omit(was3d$share.final[was3d$false.final==0 & was3d$incentive==0 & was3d$share.final==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$share.final[was3d$false.final==0 & was3d$incentive==0])))} when they judged it to be accurate $(\Sexpr{prettyNum(100*mean(glmershnerr.final))}\%, SE = \Sexpr{prettyNum(100*glmershnerr.se.final)}\%)$, $t(515)$ = \Sexpr{prettyNum(mean(tsherr.final))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(mean(tsherr.final),515)*2))}, $\emph{d} = \Sexpr{prettyNum(mean(tsherr.final)/sqrt(518))}$.  However, there was also significant variation across participants in how much data they shared when they perceived the feedback to be an error, $\chi^{2}(1)=$ \Sexpr{prettyNum(mean(glmershatt.final.chi))}, $p<$ \Sexpr{prettyNum(ptrunc(median(glmershatt.final.chip)))}.  As can be seen in Figure 1, most participants in the compatible incentive condition shared all of the trials they attributed to error at the end of the task, while a significant proportion shared none of those trials.  However, there was no such variation for data sharing in response to disconfirming feedback $\chi^{2}(2)=$ \Sexpr{prettyNum(mean(glmershfee.chi))}, $p<$ \Sexpr{prettyNum(ptrunc(median(glmershfee.chip)))}, or error attributions during the task, $\chi^{2}(2)=$ \Sexpr{prettyNum(mean(glmershatt.chi))}, $p<$ \Sexpr{prettyNum(ptrunc(median(glmershatt.chip)))}.  

<<histograms,echo=false,fig=false,results=hide>>=
glmershatt.final<-glmer(share.final~false.final+(false.final|subjectid),data=was3d[was3d$incentive==0,],family=binomial(link="logit"))
a<-sim(glmershatt.final,100)
hist.final<-c()
for(i in 1:length(as.numeric(colnames(slot(a,"ranef")$subjectid)))){
  hist.final[i]<-median(invlogit(slot(a,"fixef")[,1]+slot(a,"fixef")[,2]+slot(a,"ranef")$subjectid[,i,1]+slot(a,"ranef")$subjectid[,i,2]))}

glmershatt<-glmer(share~error+(error|subjectid),data=was3d[was3d$incentive==0,],family=binomial(link="logit"))
a<-sim(glmershatt,100)
hist.err<-c()
for(i in 1:length(as.numeric(colnames(slot(a,"ranef")$subjectid)))){
  hist.err[i]<-median(invlogit(slot(a,"fixef")[,1]+slot(a,"fixef")[,2]+slot(a,"ranef")$subjectid[,i,1]+slot(a,"ranef")$subjectid[,i,2]))}

glmershfee<-glmer(share~falsification+(falsification|subjectid),data=was3d[was3d$incentive==0,],family=binomial(link="logit"))
a<-sim(glmershfee,100)
hist.fee<-c()
for(i in 1:length(as.numeric(colnames(slot(a,"ranef")$subjectid)))){
  hist.fee[i]<-median(invlogit(slot(a,"fixef")[,1]+slot(a,"fixef")[,2]+slot(a,"ranef")$subjectid[,i,1]+slot(a,"ranef")$subjectid[,i,2]))}

glmershatt.final.i<-glmer(share.final~false.final+(false.final|subjectid),data=was3d[was3d$incentive==1,],family=binomial(link="logit"))
a<-sim(glmershatt.final.i,100)
hist.final.i<-c()
for(i in 1:length(as.numeric(colnames(slot(a,"ranef")$subjectid)))){
  hist.final.i[i]<-median(invlogit(slot(a,"fixef")[,1]+slot(a,"fixef")[,2]+slot(a,"ranef")$subjectid[,i,1]+slot(a,"ranef")$subjectid[,i,2]))}

glmershatt.i<-glmer(share~error+(error|subjectid),data=was3d[was3d$incentive==1,],family=binomial(link="logit"))
a<-sim(glmershatt.i,100)
hist.err.i<-c()
for(i in 1:length(as.numeric(colnames(slot(a,"ranef")$subjectid)))){
  hist.err.i[i]<-median(invlogit(slot(a,"fixef")[,1]+slot(a,"fixef")[,2]+slot(a,"ranef")$subjectid[,i,1]+slot(a,"ranef")$subjectid[,i,2]))}

glmershfee.i<-glmer(share~falsification+(falsification|subjectid),data=was3d[was3d$incentive==1,],family=binomial(link="logit"))
a<-sim(glmershfee.i,100)
hist.fee.i<-c()
for(i in 1:length(as.numeric(colnames(slot(a,"ranef")$subjectid)))){
  hist.fee.i[i]<-median(invlogit(slot(a,"fixef")[,1]+slot(a,"fixef")[,2]+slot(a,"ranef")$subjectid[,i,1]+slot(a,"ranef")$subjectid[,i,2]))}

png(file="was3.png",width=1500,height=1000,res=200)
hist.fee.df<-data.frame(hist=hist.fee,Incentive=rep("Compatible",length(hist.fee)),Type=rep("Disc. Feedback",length(hist.fee)))
hist.feei.df<-data.frame(hist=hist.fee.i,Incentive=rep("Perverse",length(hist.fee.i)),Type=rep("Disc. Feedback",length(hist.fee.i)))
hist.err.df<-data.frame(hist=hist.err,Incentive=rep("Compatible",length(hist.err)),Type=rep("Trial-by-Trial Error",length(hist.err)))
hist.erri.df<-data.frame(hist=hist.err.i,Incentive=rep("Perverse",length(hist.err.i)),Type=rep("Trial-by-Trial Error",length(hist.err.i)))
hist.final.df<-data.frame(hist=hist.final,Incentive=rep("Compatible",length(hist.final)),Type=rep("End-of-Task Error",length(hist.final)))
hist.finali.df<-data.frame(hist=hist.final.i,Incentive=rep("Perverse",length(hist.final.i)),Type=rep("End-of-Task Error",length(hist.final.i)))
hist<-rbind(hist.fee.df,hist.feei.df)
hist<-rbind(hist,hist.err.df)
hist<-rbind(hist,hist.erri.df)
hist<-rbind(hist,hist.final.df)
hist<-rbind(hist,hist.finali.df)

p <- ggplot(hist, aes(hist,..count..,color=Incentive,fill=Incentive),opts(panel.grid.major = theme_bw() ,panel.grid.minor = theme_bw(),panel.background = theme_bw(),axis.ticks = theme_blank())) 
p + geom_histogram()+facet_grid(Type~Incentive)+theme_bw()+ylab("Frequency")+xlab("Proportion of Trials Shared")+scale_y_continuous(limits = c(0,20))+scale_color_manual(values = c("darkblue","darkred"))+scale_fill_manual(values = c("white","white"))+opts(legend.position="bottom")
dev.off()

share.should<-glmer(share.final~actual.error+error+(1|subjectid),data=was3d[was3d$incentive==0,],family=binomial(link="logit"))
share.should.i<-glmer(share.final~actual.error+error+(1|subjectid),data=was3d[was3d$incentive==1,],family=binomial(link="logit"))
@ 

\begin{figure}[h] \pause
    \centering
\scalebox{1.4}{\includegraphics{was3}}
\caption[Experiment Three Data Sharing]{Proportion of trials shared by whether the trial was disconfirming (top row), whether participants attributed that trial to error during the task (middle row), and whether participants attributed the trial to error at the end of the task (bottom row).}
\end{figure}

Unexpectedly, participants in the perverse incentive condition did not share trials at lower rates than those in the compatible incentive condition.  They shared \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$falsification==1 & was3d$incentive==1 & was3d$share==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$falsification==1 & was3d$incentive==1])))} trials when the feedback was disconfirming $(\Sexpr{prettyNum(100*mean(glmershf.i))}\%, SE = \Sexpr{prettyNum(100*glmershf.se.i)}\%)$ and \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$falsification==0 & was3d$incentive==1 & was3d$share==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$falsification==0 & was3d$incentive==1])))} trials when it was affirming $(\Sexpr{prettyNum(100*mean(glmersha.i))}\%, SE = \Sexpr{prettyNum(100*glmersha.se.i)}\%)$, $t(520)$ = \Sexpr{prettyNum(mean(tsh.i))}, $p<$ \Sexpr{prettyNum(ptrunc(dt(mean(tsh.i),520)*2))}, $\emph{d} = \Sexpr{prettyNum(mean(tsh.i)/sqrt(523))}$.  They also shared \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$error==1 & was3d$incentive==1 & was3d$share==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$error==1 & was3d$incentive==1])))} trials when they judged the feedback to be an error during the task $(\Sexpr{prettyNum(100*mean(glmersherr.i))}\%, SE = \Sexpr{prettyNum(100*glmersherr.se.i)}\%)$ and \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$error==0 & was3d$incentive==1 & was3d$share==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$share[was3d$error==0 & was3d$incentive==1])))} trials when they judged the feedback to be accurate $(\Sexpr{prettyNum(100*mean(glmershnerr.i))}\%, SE = \Sexpr{prettyNum(100*glmershnerr.se.i)}\%)$, $t(520)$ = \Sexpr{prettyNum(mean(tsherr.i))}, \emph{p} $<$ \Sexpr{prettyNum(ptrunc(dt(mean(tsherr.i),520)*2))}, $\emph{d} = \Sexpr{prettyNum(mean(tsherr.i)/sqrt(523))}$.  At the end of the task, they shared \Sexpr{prettyNum(length(na.omit(was3d$share.final[was3d$share.final==1 & was3d$incentive==1 & was3d$false.final==1])))} of \Sexpr{prettyNum(length(na.omit(was3d$share.final[was3d$incentive==1 & was3d$false.final==1])))} trials that they judged to be an error $(\Sexpr{prettyNum(100*mean(glmersherr.final.i))}\%, SE = \Sexpr{prettyNum(100*glmersherr.se.final.i)}\%)$ and \Sexpr{prettyNum(length(na.omit(was3d$share.final[was3d$share.final==1 & was3d$incentive==1 & was3d$false.final==0])))} of \Sexpr{prettyNum(length(na.omit(was3d$share.final[was3d$incentive==1 & was3d$false.final==0])))} trials that they judged to be accurate $(\Sexpr{prettyNum(100*mean(glmershnerr.final.i))}\%, SE = \Sexpr{prettyNum(100*glmershnerr.se.final.i)}\%)$, $t(481)$ = \Sexpr{prettyNum(mean(tsherr.final.i))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(mean(tsherr.final.i),481)*2))}, $d=$ \Sexpr{prettyNum(mean(tsherr.final.i)/sqrt(484))}$.

<<weak testing and selective reporting,echo=false,results=hide,fig=false>>=
scammer.id<-c(1,3,15,16,19,36,57,68,85,96)
nonscammer.id<-c(2,4,6,7,8,9,10,11,13,14,17,18,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,58,60,61,62,63,64,65,66,67,69,70,71,72,73,74,75,76,77,78,79,80,81,82,84,86,87,88,89,90,91,92,93,94,95,97,98,99,100)
sneaky.stoppers<-c(31,53,60,65,86,87,90,92,94)
why.share.all<-c(42,54,55,56,63,73,78,88,98,100)
was3d$selective<-ifelse(was3d$subjectid %in% scammer.id,1,0)

shared.all<-data.frame(shared.all=c(rep(NA,100)))
incent<-data.frame(incent=c(rep(NA,100)))
subjectid<-data.frame(subjectid=c(rep(NA,100)))
for(i in 1:100){
shared.all[i,1]<-ifelse(length(na.omit(was3d$share[was3d$subjectid==i & was3d$falsification==1]))==0,NA,prod(na.omit(was3d$share[was3d$subjectid==i & was3d$falsification==1])))
incent[i,1]<-as.character(was3d$incentive[was3d$subjectid==i][1])
subjectid[i,1]<-i
}
shared.all.dat<-data.frame(shared.all,subjectid,incent,stringsAsFactors=FALSE)
shared.all.dat$shared.all<-as.factor(shared.all.dat$shared.all)
shared.all.dat$incent<-as.factor(shared.all.dat$incent)

                                 
sum(was3d$selective[was3d$incentive==0])/43
sum(was3d$selective[was3d$incentive==1])/43
was3d$hplus<-ifelse(was3d$num1==2 & was3d$num2==4 & was3d$num3==6,1,0)
hplusglmer<-glmer(hplus~incentive+(1|subjectid),data=was3d,family=binomial(link="logit"))

act.att<-sum(na.omit(was3d$share.final[was3d$actual.error==1 & was3d$false.final==1]))
act.att.total<-length(na.omit(was3d$share.final[was3d$actual.error==1 & was3d$false.final==1]))
act<-sum(na.omit(was3d$share.final[was3d$actual.error==1 & was3d$false.final==0]))
act.total<-length(na.omit(was3d$share.final[was3d$actual.error==1 & was3d$false.final==0]))
att<-sum(na.omit(was3d$share.final[was3d$actual.error==0 & was3d$false.final==1]))
att.total<-length(na.omit(was3d$share.final[was3d$actual.error==0 & was3d$false.final==1]))
acc<-sum(na.omit(was3d$share.final[was3d$actual.error==0 & was3d$false.final==0]))
acc.total<-length(na.omit(was3d$share.final[was3d$actual.error==0 & was3d$false.final==0]))

att.data<-na.omit(was3d[was3d$actual.error==0 & was3d$error==1,c("share.final","num1","num2","num3","falsification")])
att.data.nshare<-na.omit(was3d[was3d$actual.error==0 & was3d$false.final==1 & was3d$share.final==0,c("share.final","num1","num2","num3","falsification")])
att.data.share<-na.omit(was3d[was3d$actual.error==0 & was3d$false.final==1 & was3d$share.final==1,c("share.final","num1","num2","num3","falsification")])
att.data.share.aff<-na.omit(was3d[was3d$actual.error==0 & was3d$false.final==1 & was3d$share.final==1 & was3d$falsification==0,c("share.final","num1","num2","num3","falsification")])

act.data.nshare<-na.omit(was3d[was3d$actual.error==1 & was3d$false.final==0 & was3d$share.final==0,c("share.final","num1","num2","num3","falsification")])
act.data.share<-na.omit(was3d[was3d$actual.error==1 & was3d$false.final==0 & was3d$share.final==1,c("share.final","num1","num2","num3","falsification")])
act.data.share.aff<-na.omit(was3d[was3d$actual.error==1 & was3d$false.final==0 & was3d$share.final==1 & was3d$falsification==0,c("share.final","num1","num2","num3","falsification")])

false.final.glmer<-glmer(false.final~actual.error*falsification+(1|subjectid),data=was3d,family=binomial(link="logit"))
@ 

As seen in Figure 5.3, there was significant variation across participants in their decisions to share data after receiving disconfirming feedback, $\chi^{2}(1)=$ \Sexpr{prettyNum(mean(glmershfee.i.chi))}, $p=$ \Sexpr{prettyNum(ptrunc(median(glmershfee.i.chip)))}, whether they shared data that they perceived to be error during the task, $\chi^{2}(1)=$ \Sexpr{prettyNum(mean(glmershatt.i.chi))}, $p=$ \Sexpr{prettyNum(ptrunc(median(glmershatt.i.chip)))}, and whether they shared data that they perceived to be error at the end of the task, $\chi^{2}(1)=$ \Sexpr{prettyNum(mean(glmershatt.final.i.chi))}, $p<$ \Sexpr{prettyNum(ptrunc(median(glmershatt.final.i.chip)))}.  For all three judgments, most participants in the perverse incentive condition shared all of their trials, with a minority sharing less.  

Our prediction was that some participants would be seduced by the perverse incentive, thus deciding only to share trials that were consistent with their final answer.  However, there was no difference between conditions in the probability of omitting data that were inconsistent with their final answer, $t(999)=$ \Sexpr{prettyNum(fixef(final.consistent.t)[4]/sqrt(diag(vcov(final.consistent.t))[4]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(final.consistent.t)[4]/sqrt(diag(vcov(final.consistent.t))[4]),999)*2))}.  A second way that participants could produce these results while exploiting the perverse incentive would be to seek out only affirming data, knowing they data would make a simple and convincing story.  One way to implement this weak testing strategy is to propose the (2,4,6) triple, knowing that they would receive affirming feedback unless the feedback is in error.  However, participants in the two incentive conditions were equally likely to propose (2,4,6) triples, $t(1154)=$ \Sexpr{prettyNum(abs(fixef(hplusglmer)[2]/sqrt(diag(vcov(hplusglmer))[2])))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(hplusglmer)[2]/sqrt(diag(vcov(hplusglmer))[2]),1154)*2))}.

As participants were both accurate and consistent in their error attributions, they may have been able to remove actual errors from the data they shared.  Overall, at the end of the task participants shared \Sexpr{prettyNum(act.att)} of \Sexpr{prettyNum(act.att.total)} (\Sexpr{prettyNum(100*act.att/act.att.total)}\%) trials that were both actual errors and perceived as errors, \Sexpr{prettyNum(act)} of \Sexpr{prettyNum(act.total)} (\Sexpr{prettyNum(100*act/act.total)}\%) trials that were actual errors but not perceived as errors, \Sexpr{prettyNum(att)} of \Sexpr{prettyNum(att.total)} (\Sexpr{prettyNum(100*att/att.total)}\%) trials that were perceived as errors but not actual errors, and \Sexpr{prettyNum(acc)} of \Sexpr{prettyNum(acc.total)} (\Sexpr{prettyNum(100*acc/acc.total)}\%) trials that were neither perceived as error nor actual error.  When including both main effects and the interaction between actual error and attribution of error to predict whether each trial would be shared at the end of the task, there was only a significant main effect of error attribution, and not actual error, for both compatible and perverse conditions ($t(509)=$ \Sexpr{prettyNum(abs(fixef(share.should)[3]/sqrt(diag(vcov(share.should)))[3]))}, $p<$ \Sexpr{prettyNum(ptrunc(dt(fixef(share.should)[3]/sqrt(diag(vcov(share.should)))[3],500)*2))} vs. $t(479)=$ \Sexpr{prettyNum(abs(fixef(share.should.i)[3]/sqrt(diag(vcov(share.should.i)))[3]))}, $p<$ \Sexpr{prettyNum(ptrunc(dt(fixef(share.should.i)[3]/sqrt(diag(vcov(share.should.i)))[3],500)*2))}, respectively).  This means that error attributions, but not actual errors, matter in determining whether data is shared.  

The reason perceived and actual errors diverged was that disconfirmation had a systematic and additive effect on perceived error, even after controlling for actual error.  Main effects of both actual error ($t(1025)=$ \Sexpr{prettyNum(fixef(false.final.glmer)[2]/sqrt(diag(vcov(false.final.glmer))[2]))}, $p<$ \Sexpr{prettyNum(ptrunc(dt(fixef(false.final.glmer)[2]/sqrt(diag(vcov(false.final.glmer))[2]),1025)*2))}) and disconfirming feedback ($t(1025)=$ \Sexpr{prettyNum(fixef(false.final.glmer)[3]/sqrt(diag(vcov(false.final.glmer))[3]))}, $p<$ \Sexpr{prettyNum(ptrunc(dt(fixef(false.final.glmer)[3]/sqrt(diag(vcov(false.final.glmer))[3]),1025)*2))}) increased the chance of attributing a trial to error at the end of the task, with no significant interaction between the two ($t(1025)=$ \Sexpr{prettyNum(fixef(false.final.glmer)[4]/sqrt(diag(vcov(false.final.glmer))[4]))}, $p=$ \Sexpr{prettyNum(ptrunc(dt(fixef(false.final.glmer)[4]/sqrt(diag(vcov(false.final.glmer))[4]),1025)*2))}).  Thus, affirming trials were shared more often, as they were less likely to be perceived as errors than disconfirming trials even when they were actually errors, whereas disconfirming trials were shared less frequently because they were inappropriately seen as errors when they were not.

\subsection{Discussion}

Participants with a compatible or perverse incentive to share data were equally likely to attribute disconfirming feedback to error.  The financial penalty for making incorrect probability judgments and error attributions produced greater consistency and accuracy, compared to Experiments One and Two.  Participants in both incentive conditions also shared fewer trials whose feedback was disconfirming or attributed to error, either during or at the end of the task.  Although participants were successful in identifying actual errors, it was attributions of error that determined whether they shared trials, indicating that being able identify error does not preclude failing to share trials with accurate disconfirmations, while sharing ones with inaccurate affirmations.

We expected the perverse incentive to reduce the consistency and accuracy of error attributions, as well as to reduce the sharing of data attributed to error.  However, such motivated reasoning was not observed.  Rather, data sharing behavior in the two conditions differed in an unexpected way.  Both for decisions made after each trial and at the end of the task, participants in the perverse incentive condition shared \emph{more} data than those in the compatible incentive condition--thereby demonstrating a more ethical data sharing stance.  While it is possible that higher stakes, such as those involved in pharmaceutical or academic research, would lead to motivated reasoning and data sharing policies, participants responded to the moderate stakes used in this research with reasoned and ethical behavior.  

In decisions made at the end of all trials, however, some participants in the perverse incentive condition decided to share none of the data they attributed to error.  Contrary to our prediction, participants in the perverse incentive condition did not omit more trials that were inconsistent with their final answer than those in the compatible incentive condition.  Additionally, those in the perverse incentive condition did not try to produce a convincing story by taking as few trials as possible, in order to reduce the risk of collecting inconvenient data, either making their Final Answer less convincing or requiring selective reporting.

There are several possible explanations why participants in the perverse incentive condition shared trials at a higher rate than those in the compatible incentive condition.  First, they may have thought that the learner knows they can hide data, even though the instructions indicated that only the trials they decided to share would be shared.  Second, they may have believed that sharing more trials increases the learner's confidence, regardless of whether they are consistent with their Final Answer.  Third, they may have been more strongly motivated to do the right thing and give the learner all the data available, even if that came at the cost of their own compensation.  Four examples of such motivation:

\begin{quote}
\begin{enumerate}
\item ``\Sexpr{was3.t$V1096[68]}''
\item ``\Sexpr{was3.t$V1063[28]}''
\item ``\Sexpr{was3.t$V1063[29]}''
\item ``I shared everything because, not knowing if the FIT/DNF response by the computer was correct, I didn't want to deliberately bias the info I passed on by being selective.''
\end{enumerate}
\end{quote}

<<perverse.open,echo=false,results=hide,fig=false>>=
perverse.open1<-was3.t$V318
perverse.open2<-was3.t$V320
compatible.open1<-was3.t$V322
compatible.open2<-was3.t$V323
@ 

Thus, Experiment Three extends the positive test strategy to communication of results, seen in selective reporting, such that disconfirming data are seen as both caused by error and not worthy of sharing with others.  Contrary to our prediction of motivated reasoning \cite{kunda1990case}, the perverse incentive condition not only did not increase error attributions, but increased the sharing of data that were disconfirming or attributed to error.

\section{General Discussion}

Over 50 years of psychological research has found that hypothesis testing follows a positive test strategy \cite{klayman1987confirmation}, whereby people collect data that they expect to affirm their expectations and discount disconfirming data, should it nonetheless reach them.  The present study asks how the positive test strategy affects data sharing.  We use the Wason 2-4-6 rule discovery task \cite{wason1960failure}, adding the possibility of error to simulate the uncertainty of actual research \cite{penner1996trust}.  In this task, participants seek to discover a rule by conducting `experiments' to test their hypotheses about its answer, then receive affirming or disconfirming feedback, known to have a 20\% error rate.  We extended the task by adding several incentive schemes, then examining their effects on participants' decisions about sharing the feedback they received with another person.  We also evaluated participants' performance in terms of the accuracy and consistency of their judgments of whether the feedback is error.

Experiment One replicated the pattern of results from previous studies, finding that disconfirming feedback is attributed to error more often than is affirming feedback \cite{penner1996trust}.  A new result is that participants' error attributions were generally consistent with their prior beliefs, in the sense of their being more likely to attribute affirmative feedback to error when they had strongly expected that the triple would not fit the rule, and being more likely to attribute disconfirming feedback to error when they had strongly expected the triple to fit the rule.  However, their judgments of whether the feedback was in error were unrelated to its accuracy.  Whether they shared trial results was unrelated to whether the feedback was disconfirming or attributed to error.

Experiment Two replicated Experiment One along with a new condition that provided participants with a large financial incentive for discovering the rule.  As in Experiment One, participants attributed disconfirming feedback to error at a greater rate than affirming feedback in the control condition, but not the incentive condition.  Those in the control condition again made error attributions that were somewhat consistent with their expectations but were quite inaccurate.  In contrast, participants in the incentive condition were neither consistent nor accurate.  Experiment Two elicited data sharing decisions after each trial, using a fixed-response format, unlike Experiment One which asked a single open-ended question at the end.  Participants in both conditions were more likely to share feedback if it was affirming and perceived to be accurate.

Experiment Three introduced two incentive schemes for sharing data: (a) \emph{compatible} incentives rewarded the sharer and receiver based on the receiver's success; (b) \emph{perverse} incentives rewarded the sharer based on whether the receiver believed that the problem had been solved, and did not disclose when data were not shared.  Both conditions penalized participants for making inaccurate probability and error judgments.  As before, participants in both conditions were more likely to attribute feedback to error when it was disconfirming.  The penalty increased both the accuracy and consistency of error attributions for participants in both conditions, compared to Experiments One and Two.  Contrary to prediction, participants with the perverse incentive shared more trials that were disconfirming or attributed to error than did participants with the compatible incentive.  In both conditions, despite these participants' ability to identify error feedback, their perception of error was more important than actual error in determining their data sharing.

The present research has several internal and external validity limitations.  In terms of internal validity, data sharing judgments were worded as ``information sharing'' possibly sending the message to participants in the perverse incentive condition that they should share rather than hide data.  Many participants also discontinued their participation prematurely, explaining that the rule was too simple, not realizing that they had not identified it.  For example:

\begin{quote}
``\Sexpr{was3.t$V1086[81]}''%$
\end{quote}

Those who quit prematurely, proposing only a few trials, also proposed only trials that they expected to receive affirmation, and received only affirmation, except for rare errors that they were highly accurate in identifying.  This confound limited participants' chance of obtaining disconfirming feedback.  As disconfirming feedback is necessary for selective reporting, this confound causes the experiments to underestimate its magnitude.

The circumstances of the experiments differ from those of working scientists in several ways.  First, scientists never know the exact error rates in their experiments, but have, instead, just a range of plausible values based on their experience and intuition.  Those ambiguous error rates may be more readily modified to fit results than the fixed ones used in the experiments.  Second, although the patterns observed here generally parallel those observed in real labs \cite{dunbar1995scientists}, the participants were either undergraduates or MTurk respondents, not scientists.  The training and experience of working scientists may allow them to identify and report only accurate data, appropriately omitting errors that would confuse readers.

An additional experiment is needed to clarify the data-sharing results from Experiment Three.  It found that participants given a perverse incentive behaved more ethically than participants in the compatible incentive condition, in the sense of sharing more trials that were disconfirming or attributed to error.  One possible cause of this surprising result is that participants may have thought that the other participant knew they could hide trials, hence might become suspicious if data were too orderly.  The second is that the sharers were genuinely willing to sacrifice their own pay to benefit others, with incentives that evoked ethical concerns.  To determine which explanation is correct, Experiment Four will explicitly manipulate whether participants are told that the person receiving the data knows that the sharer does not have to include all the data, while also using more neutral language so the task is not perceived as being about cooperation or sharing.  Additionally, all participants will be told the correct answer at the end of the task.  They will then be allowed to modify the data they share, but not adjust their Final Answer.  Thus, concern for ethics and altruism should lead participants to change the data they share to match the correct answer, even at the likely cost to their own payoff.  However, if other concerns determine their data sharing, such as uncertainty about whether trials were errors or fear of being caught, then knowing the correct answer should allow participants to share only trials that are consistent with their Final Answer, especially for those who believe they cannot be caught.

Experiment Five will tie everything up with the best method derived from the previous experiments.  For data sharing to matter, participants must not be able to solve the rule easily, as if they do solve the rule then there is no potential conflict between their Final Answer and the Actual Rule, and no opportunity for selective reporting.  To do this, Experiment Five will use an alternative rule ($x,x^{2},x^{2}+2$) that should give more disconfirming feedback and be difficult to solve, thus encouraging participants to see the task as a challenge and complete more trials.  Experiment Five will also use a more contextualized task, so that the terminology is easier to comprehend (e.g., true or false feedback).

The results of three experiments suggest that financial penalties are needed to help participants accurately evaluate their data.  Without such penalties, Experiments One and Two elicited error attributions that were largely inaccurate and inconsistent with prior beliefs.  In Experiment Three, adding a financial penalty for incorrect judgments substantially increased consistency and accuracy.  However, they still shared data that were systematically biased by feedback, including inaccurate affirmations and excluding accurate disconfirmations.  This selective reporting occurred even when poor data sharing could cost the sharer money, as in the compatible incentive condition.

The difficulty participants had when trying to avoid sharing errors shows that helpful selective reporting is not easy.  One strategy participants could have used to achieve accurate selective reporting would be to use exact replications.  Participants in all three experiments did not have the perfect accuracy in error attributions that would be required to selectively exclude errors from shared data.  At the end of the task, exact replications would allow participants to clearly identify which trials were error and which were accurate, and, in turn, selectively report only accurate data.  

Similar policies can help real scientists share data.  Experiment Three found that penalties for incorrect probability judgments and error attributions greatly increased consistency and accuracy.  One way to implement such a penalty would be to require that statistical analyses and experimental methods presented in published reports provide enough detail, in the paper or ancillary material, to be reproducible--with appropriate professional penalties for those who fail.  As a protection, researchers can adopt the protocols of impartial organizations dedicated to independent replication of experiments and analyses (e.g., \url{https://www.scienceexchange.com/}).  Another way of improving error identification is to encourage researchers to complete exact replications.  These replications allow researchers to identify errors with high accuracy and make selective reporting of perceived errors highly accurate.  

\part{Prescriptive}
\section*{Introduction to the Prescriptive Analysis}

The final part of the dissertation proposes methods of bringing human behavior, as determined by the descriptive analyses of Part Three, in line with normative standards, as proposed in Part Two.  Chapter Two concluded that, although there is no logical ground for determining whether data or theory is faulty when they conflict, data sharing policies that omit disconfirming data are unethical because they impose conventions on the reader, thus deceiving them.  However, Chapter Four found that surprising disconfirmations are perceived to be caused by error, and future observations that were seen as diffuse were judged to be less worthy of publication.  Chapter Three concluded that disconfirmations are more likely to be errors than affirmations only when the selection of true hypotheses is common.  However, participants in the Wason rule discovery task thought the opposite.  With no penalty for incorrect error attributions, participants proposed triples that did not fit the rule (false hypotheses) more often than those that did fit the rule, but attributed error more often to disconfirmation than affirmation.  Finally, in the rule discovery task, probability judgments improved with a financial penalty for incorrect answers.  As judgments involving probability and statistics are always communicated in research, Chapter Six proposes methods of documenting data, methods, and statistical analyses so that penalties can be implemented when inferences are faulty. 

\chapter{Open Communication}
Up to this point, the dissertation has dealt mainly with the philosophical, mathematical, and psychological challenges to data sharing.  In this chapter, I outline a simple procedure for implementing data sharing practically.  It uses three technological solutions:
\begin{itemize}
  \item Open-Data
    \item Open-Methods
      \item Open-Analyses
        \end{itemize}

The hope is that, as these elements are laid out and standardized, journals are likely to change their policies to meet the standards, as indicated by one editor of the journal Nature (\href{http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CCQQFjAA&url=http%3A%2F%2Fwww.stanford.edu%2F~vcs%2FNov21%2Fhilary_spencer_rdcscsJan2010.pdf&ei=OgZBUNXUDIjv0gGioIGIDA&usg=AFQjCNG3iFutwkWUiPJH0pxOp-Fjb9Ogrg&sig2=vQvLUEErmUs75TgWrf7Msw}{Spencer, 2010}).  Additional information on communication of research and uncertainty can be found in \cite{fischhoff2012communicating}. 

\section{Open and Archived Data}

To meet the criterion set out in Chapter Two of imposing minimal irrevocable conventions on the reader, a generic open-data convention is needed.  Luckily, this has been done for us (\href{http://opendefinition.org/}{OpenDefinition}):
\begin{quote}
  ``A piece of content or data is open if anyone is free to use, reuse, and redistribute it---subject only, at most, to the requirement to attribute and/or share-alike.''
\end{quote}

Along with this open-data definition, social scientists need to compile a list of conventions they consider important, and release a document like the CONSORT statement \cite{schulz2010consort}.  Once these minimal conventions are agreed on then data can be documented and archived according to these conventions on a variety of websites, such as \href{http://thedata.org}{DataVerse} and \href{http://psychfiledrawer.org}{PsychFileDrawer}.

All of the data and materials from Chapter 4 are here: 
\begin{quote}
  \centering
  Davis, Alexander \\
  ``Surprises, Error, and Data Sharing'' \\
  \url{http://hdl.handle.net/1902.1/14819} \\ 
  V3 [Version] \\
\end{quote}

All of the data and materials from Chapter 5 are here: 

\begin{quote}
  \centering 
  Davis, Alexander\\
  ``Incentives, Error, and Data Sharing''\\
  \url{http://hdl.handle.net/1902.1/18699} \\  
  V1 [Version] \\
\end{quote}

\section{Open and Archived Methods}
The data are only half of the documentation process.  The methods used to generate the data need to be as, or more, carefully documented.  This can be seen as a problem of version-controlling one's experiments (\href{http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CCIQFjAA&url=http%3A%2F%2Fwww.stanford.edu%2F~vcs%2FAAAS2011%2F1102_aaas_reproducibility_fperez.pdf&ei=eghBULyNHsHB0QHgt4DgDA&usg=AFQjCNFT_Om98Vi77WiJWoa4Bd9M9Dd_AQ&sig2=UuFf0uEETcOhHvy1VfZL-A}{Perez, 2011}), which can be dealt with using version controlling software like \href{http://git-scm.com/}{Git}.  This version controlling can be used for any part of the research process, from the development of methods, computational tools, and materials to the writing of papers and grants.  References can be made to documentation of the run-up to discovery, or ``warm-up period'' using open lab notebooks such as \href{http://openwetware.org/wiki/Main_Page}{OpenWetWare}.  Some pretesting from Chapters Four and Five are here \href{http://openwetware.org/wiki/User:Alexander_L._Davis/Notebook/Error_Models_and_Data_Sharing_in_Hindsight}{Alex's OpenWetWare}.

  \section{Open and Reproducible Data Analysis}
To avoid the need for forensic statistics, aimed at recreating the black box of what the authors must have done to their data \cite{baggerly2009deriving}, and to make statistical analyses truly reproducible, I use \href{http://www.statistik.lmu.de/~leisch/Sweave/}{Sweave}.  Sweave integrates the free \href{http://www.r-project.org/}{R} statistical computing language with the free document preparation system \href{http://www.latex-project.org/}{\LaTeX}.  All statistical analyses are coded in R directly into the Sweave document, which is embedded in a \LaTeX document.  Thus, any statistical analyses done can be easily read and reproduced along with the entire published paper with the Sweave document and the original data files.  In fact, this entire dissertation was written this way, and the Sweave document can be obtained here (link).  I encourage readers to reproduce and check my code for errors, with rewards for those who succeed in finding errors.

\part{Conclusion}
\chapter{Recapitulation}
This dissertation has evaluated the file-drawer problem---where disconfirming data are selectively excluded from published reports---from historical, normative, empirical, and prescriptive perspectives.  The historical perspective suggests that incentives to publish only confirming data, as well as perceptions that disconfirming data are faulty, are the most likely causes of the file-drawer problem (Chapter One).  These incentives threaten the scientific community and altruistic researchers, as the community can only identify valid data if penalties for selective reporting are high enough.  Additionally, approaches that allow researchers to determine when data are faulty are based on possibly useful conventions that are not universally justified.  Because researchers have some flexibility in determining the reporting conventions that are most appropriate for their circumstance, any ethical data sharing policy must not impose these conventions on other readers who must use their data, as this imposition is deceptive and, in turn, unethical (Chapter Two).  Chapter Three analyzes two conventions that may be invoked to justify discarding disconfirming data: 1) that disconfirming data are less informative than affirming data, and 2) that disconfirming data are more likely to be faulty.  The first conjecture was found to be true in the usual case social scientists face, where Type 1 errors are fixed at a much lower rate than Type 2 errors.  The second conjecture, on the other hand, was found to be usually false, as the likelihood of error in disconfirming data depends mostly on whether one is generally good at choosing true hypotheses, which is most likely false when scientists are doing groundbreaking work.  Chapter Four explores human judgments of error in scientific data in hindsight and foresight.  Although the tendency to attribute disconfirming results to error was no greater in hindsight than foresight, these error attributions led to diffuse or uniform predictions for future data, and the greater the expected diffuseness of future data, the less likely participants were to see the data as worth sharing.  Chapter Five found that, although participants were generally poor at picking triples that fit the rule, they expected disconfirming feedback to be error more than affirming feedback, violating the normative analysis in Chapter Three.  Furthermore, participants shared disconfirming feedback less than affirming feedback, and this was strongly determined by their perception of fault in the data, rather than actual fault.  Finally, Chapter Six proposes three simple technologies, open-data, open-methods, and open-analyses, that allow for minimal conventions to be imposed on those receiving data, promoting the ethical data sharing policies outlined in Chapter Two.  

\chapter{Future Directions}
This chapter discusses future directions that will follow the dissertation.  The overarching goal is to promote logical and mathematical analyses of data sharing problems, examine how humans actually analyze and share data, and develop methods that can help identify error in data and effectively communicate results.

\section{Normative}
Two open questions remain from the normative analysis: 1) how do we establish the conventions of members of the scientific community, and 2) what general data sharing policies can be developed if one wants to maximize the informativeness of the shared data?  The first part will explore the formal and practical implications of basing sharing policies on the explicit conventions of members of the scientific community, as well as create a basic structure, or ontology such as the OWL \cite{schneider2011reasoning}, such that ``minimal conventions'' can be precisely defined and standardized \cite{king2009automation}.  The second part will integrate mathematical analyses of information sharing and social learning from Chamley \cite{chamley2004rational} and Hirshleifer \cite{hirshleifer2003limited} to accommodate agents with bounded rationality in terms of confusion and attention.

\section{Descriptive}
To the maximum extent possible, future empirical research will be selected based on the ability to meet the following criteria:
\begin{enumerate}
\item The task should be a real problem, not a made-up one.
\item Solving the real problem should provide a real social good.
\item Solving the real problem should allow participants to learn.
\item Participants should be allowed to contribute or co-author the final paper.
\end{enumerate}

\subsection{Surprises, Error, and Data Sharing}

\subsubsection{Concluding studies}
A final approach to examining the effect of hindsight on error attributions would be to use the more traditional hindsight paradigm, where participants are asked to `re-judge' their prior beliefs after receiving outcome knowledge.  Specifically, they would be asked ``how likely would you have been to identify this cause of the error?'' versus having them actually do so in foresight.

\subsubsection{Suppositional versus conditional causes and data}
Experiments Two through Four found that observed data are expected to be more likely to replicate than supposed data \cite{zhao2012updating}, but the probability of the causes of supposed versus observed data do not change.  This line of research will focus on discovering why there is this asymmetry between supposing and observing data versus supposing and observing causes.

\subsubsection{Deciding on hypothesis generation versus evaluation}
Experiment Five found that more `natural' explanations that come to mind easily when designing an experiment are also seen as relatively more likely after observing the results compared to explanations that may have required more thought (and even empirical observation) to generate.  Deeper probing in foresight may help, by making sure all explanations that are serious possibilities are considered before observing the results.  Unfortunately, there is no limit to this time consuming and often frustrating process, so the termination of this process ultimately depends on a judgment that the so-far-considered explanations are `good enough'.  This line of research will look at how lay participants and scientists determine when hypothesis generation is good enough to cover all the serious explanations and possible errors of an experiment, and then engage in the experiment itself, rather than remaining in a purely reflective mode, as philosophers do.

\subsubsection{The problem of old evidence}
A third set of experiments will extend Experiment Five and examine the degree to which a hypothesis (either a core hypothesis or error model) not considered a priori can be believed after observing the data.  Given that Bayesian calculations do not apply in this situation, what psychological processes are involved in determining the posterior convincingness of hypotheses?  This is closely related to the problem of old evidence \cite{howson1989scientific} and belief revision \cite{suzuki2005old,chihara1987some}, and is a common problem faced by the FDA \cite{woodcock2005fda}:  
\begin{quote}
``Dr Schultz: I think the question related to new information from the outside. We have been in situations where new data (e.g., from other studies) have come to light as we were analyzing a specific study, which could influence the outcome either positively or negatively. This is a difficult problem, and I am not sure exactly how to deal with it. In particular, it is hard to suggest ignoring negative information that could impact our assessment of the safety of the product''.
\end{quote}
  
\subsubsection{Polanyi's wild goose chase}
A fourth set of experiments will examine how participants decide whether to pursue anomalies or continue on their research project; that is, whether and how they decide to engage in the `wild goose chase'.  This will also look at whether a `warm-up' period is used and whether this period is flexibly defined to selectively report data.

\subsection{Wason Rule Discovery Task}
\subsubsection{Networks and Communication}
Behavioral data sharing policies can be compared to rational analyses of social learning \cite{chamley2004rational} in simulated environments where multi-way communication is possible.  For example, participants in the Wason rule-discovery task can engage in rule discovery in parallel and complementary ways, sometimes even in direct competition.  They can also communicate their results back and forth to each other directly or through intermediaries (e.g., journals) that determine how the data will be disseminated and what rewards each researcher receives.

\subsubsection{Debugging}
An extension to the Wason paradigm that provides both greater external validity and practical application would be hypothesis testing and error identification in programming and debugging code.  Every code is an attempt to solve a problem, a hypothesis, or a conjecture.  When the code does not work, there are a number of ways it could fail.  Importantly, a program could fail either because it doesn't actually solve the problem or because of a typo or bug in the code.  This is a genuine hypothesis testing problem with the possibility of error, with real consequences, and can be used to solve real world problems as well as teach participants to program, a valuable skill.

\subsubsection{Computational Modeling}
Markov Decision Processes \cite{puterman1994markov} provide an important modeling extension to the Bayesian analyses proposed so far.  Computational models using Markov Decision Processes will be compared against actual human behavior.

\subsubsection{Collaborative filtering and review}
A field experiment will test whether an open and collaborative publication system could work, similar to \href{http://arxiv.org}{ArXiv} or \href{http://stackexchange.com/}{Stack Exchange}, by getting error models from the general public.

\subsubsection{Crowdsourced discovery}
`Crowdsourced' science can be used both to solve real world social science problems \cite{von2006games}, possibly faster than an individual experimenter could, while simultaneously providing externally valid data on scientific reasoning and educating the public by engaging them in real science \cite{von2006games}.  The idea is to give MTurk participants a real social science/science problem and see what they do, following the process enumerated below:
\begin{enumerate}
\item I choose the problem.
\item The crowd comes up with the solution.
\item I conduct the actual study.
\item I give them feedback.
\item Repeat.
\end{enumerate}

To really study scientific reasoning, participants have to make their own instruments, develop real hypotheses, and get real feedback.  As most lay people could develop a survey to test a simple hypothesis, this task fits well.  Thus, by selecting a real social science survey research problem and having participants develop their own surveys to test it, one can get data on real scientific reasoning in a controlled manner.  It is also possible to compare this to a graduate student trying to solve the problem, to see who can come up with faster/better/more cost efficient solutions.  It also educates the public about science directly, providing another social good.  

\section{Prescriptive}
I intend to write a book, \emph{Breeding Orchids} that thoroughly discusses how benchwork is and should be done, focusing on pre-testing, pilot-testing, hypothesis formulation, and evidence synthesis.  The Breeding Orchids approach will be applied to different research projects, including my own, as well as adapted to pharmaceutical drug research in the hopes of process improvement, as  ``the drug discovery process involves blunders, wrong turns, false hypotheses as well as successes, but none of the failures are being reported or published in the journals and other sources'' \cite{kundoor2010uncovering}.  The meta-analytic approach will also be extended to include formal proofs, as well as tested for usability and practical application.  Application would ideally be on research involving cutting edge experiments, such as drug discovery involving a pharmaceutical company or academic researchers.

\section{Other Causes}
There are also a variety of other causes of the file-drawer problem beside the two investigated in this dissertation.  These are listed and briefly discussed below.

\subsection{Ad-hoc Methods}
One reason for the file-drawer problem in Psychology is that methods are rarely standardized or repeated, so there is much room for negative results.  Psychologists almost always develop their own paradigms, including materials and procedures, to test novel hypotheses.  The tinkering involved in this process involves many experiments that produce null results.  Take Michael Gorman's experience: 
\begin{quote}
  ``Behind virtually every published experiment is an extensive series of such pilot studies, where one tinkers with procedures and variables to find a promising combination. Such tinkering is never referred to in a published report, except perhaps as a footnote, but the most significant discoveries often occur when piloting'' (pg. 81-82, 87) \cite{gorman1992simulating}.
\end{quote}

He leaves us with the impression that experimental psychology produces vastly more experiments than are reported, and these unreported experiments are inconsistent with the experimenter's theory.  As a result, one would expect that any experimental psychologist, if properly incentivized to be honest (for example, offering him or her tenure), would probably tell us that they conduct anywhere between 3 and 10 times as many experiments as they report.  Unfortunately, empirically verifying this fact is very difficult.

\subsection{No Exact Replications}
Sterling \cite{sterling1959publication} argues that publication bias causes a perverse cycle, where non-significant results are not reported, and without knowing this, others will replicate this failed result.  In contrast replications do not occur for studies that get significant results.  Thus, the profession proceeds in a wasteful cycle of replicating experiments that test false hypotheses that we should have known were false if null results were properly documented, and not replicating hypotheses that we believe are true because no one has a reason to do so.

\subsection{Over-Optimism}
Sometimes file-drawers emerge because an experiment seems easier than it actually is.  In the case of the simple experiment to demonstrate cold fusion by Pons and Fleishman of the University of Utah, many poor experiments were conducted because ``many were taken in by the seeming ease of the experiment only to discover that a palladium electrolytic cell was a deal more complicated than expected'' \cite{collins1998golem}.

\subsection{Lack of Interest}
Reysen \cite{reysen2006publication} surveyed 237 faculty members in Psychology, finding that faculty members don't write up non-significant results because they see them as unpublishable, a waste of time, due to flawed in results, that they cannot understand the results, or that the results are useless.  Similarly, non-significant results are ``generally boring; it's difficult to get up the enthusiasm to write them up, and it's difficult to get them published in decent journals'' \cite{mcdonald2009handbook}.  The boring, uninteresting work of writing up negative results is seen as telling us ``nothing new or interesting'', and as laughable as ``a cover story in Nature trumpeting people Can't Fly!'' \cite{dunning}.  There also may be a generational component, as David Singer sees unwillingness to ``benefit from a nice piece of research that looks like it tells us nothing'' as an attitude that applies to older scientists, whereas ``young scientists are right to insist we start publishing negative results'' \cite{skloot2006publication}.  Media reports are more likely to pick up on ``interesting positive results showing cancerous effects of nuclear radiation than `uninteresting' negative results showing no effect'' \cite{koren1991bias}.  Timmer \emph{et al.} \cite{timmer2002publication} surveyed authors of abstracts to see whether they published their results subsequently.  Most failed to follow up on negative results because they found them uninteresting, or too difficult to publish, or didn't have time.  They did not find that studies with statistically significant results were more likely to be published than those that weren't significant, but those with positive results did have a higher impact (in terms of citations) post-publication.  As in psychology, publishing data that suggest a medical treatment is ineffective is difficult.  For example, journal editors will argue that such evidence is not ``novel'' \cite{vergano2001filed}.  Failures to publish non-significant results are likely due to lack of motivation on the part of the researcher.  

\subsection{Low Power and Rare Discoveries}
John Ioannidis became interested in the problem of biased medical research and unwarranted conclusions when ``poring over medical journals, he was struck by how many findings of all types were refuted by later findings'' \cite{freedman2010lies}.  He uses a simple Bayesian formulation to demonstrate that, as long as the prior probability of discovering a truly effective medical treatment is low, statistical tests with very high sensitivity and specificity will still error more of the time when they lead one to conclude a discovery has been made compared to when they lead to the conclusion of no discovery \cite{ioannidis2005most}.  Additionally, when sample sizes are small, effect sizes are small, there is flexibility in procedures and definitions, financial conflicts of interests, and competition, bias is likely to greatly increase the proportion of false discoveries deemed true.  Future directions could involve evaluating the rarity of true hypotheses and discoveries among those that are conjectured.

\subsection{Competition}
Hilary Spencer of the Nature publishing group remarked that that researchers report not sharing their data with others because they want to maintain a ``competitive advantage in publication'' (pg. 3).  A recent anonymous editorial in Nature points out that a researcher may ``hoard her samples out of fear of competition; another doggedly promotes his hypothesis long after the data have falsified it; negative results are hidden because of competing financial interests'' \cite{Anonymous}.  Negative data may be a type of public good ``which helps other laboratories while not materially advancing their own reputations'' \cite{mccormick2007positive}.  Future directions involving competition between researchers in simulated environments could examine whether incentives in the form of competition lead to the file-drawer problem.

\section{Conclusion}

The file-drawer problem has been a concern for social scientists over the last fifty years, and seems to be getting worse \cite{fanelli2010positive,fanelli2012negative}.  However, a recent burst of activity has focused researchers, across a variety of disciplines, on this issue, with transparent approaches to documenting data \cite{king2007introduction}, statistical analyses \cite{stodden2009enabling}, and methods (\href{http://openwetware.org}{OpenWetWare}).  Hilary Spencer of the Nature publishing group expects journals to follow the lead of researchers when we clearly articulate our data sharing policies.  The journal \emph{Perspectives on Psychological Science} has a new special section dedicated to the file-drawer problem, thanks to Bobbie Spellman \cite{spellman2012introduction}.

The file-drawer problem is closely related to the dilemma of trying to understand why failed predictions occur, either in replications of the experiments of others, of our own, or for new experiments.  As discussed in Chapter Two, determining whether theory or data are false when they conflict relies on convention.  However, Chapters Four and Five show that these conventions are determined by the perception of error, or a fallible `psychology of observation', which in turn is determined by the outcome of the experiment.  As Chapter Five shows, error attributions are more likely to be made, and data less likely to be shared, when the outcome of the experiment is disconfirming, even after accounting for actual error.  

How can this knowledge of error perception and communication improve scientific methods and reporting?  First, we can acknowledge that we see ourselves as more like Millikan (potential Nobel laureates) than Blondlot (fooling ourselves), but \emph{behave} more like Blondlot than Millikan, as indicated by both error attributions and data sharing policies.  Thus, as open-access data advocates have argued, the norm of sharing all of our data should allow the ``story of the failures that make the successes possible'' (137, p. 15) to be told when our intuition would prescribe otherwise.  While this open approach helps others understand what we've done, it also helps make clear that the pattern of failed prediction and ad-hoc error attribution are an inevitable part of everyday scientific practice, regardless of the amount of pre-testing and pilot-testing we conduct, and thus education must include this process.  Although sometimes briefly discussed in research methods courses, the methodology, logic, and appropriate reporting of pre-tests and pilot-tests is not clearly defined.  I have not found a research methods textbook that covers pre-testing and pilot testing.  This dissertation makes clear the need for educational programmes that include these practical elements of `benchwork', that are currently implicit and hidden.

The dissertation is an example of evidence-based methodological research, using logical and mathematical analysis, empirical observation and experimentation.  Hopefully, the recent willingness of psychologists and related social scientists to embrace serious methodological problems in their fields, such as the file-drawer problem, will make the solutions themselves an example of the better science that is needed.  In the dissertation I tried to do this.  The one spark that the file-drawer problem has ignited, both in myself and among other social scientists, is an affirmation that there is only one rule when doing science: \emph{not fooling ourselves}.  I was allowed to make this dissertation transparent, open, and skeptical, and this has kept my spark alive.  Hopefully I can do this for others.  I believe maintaining this spark will keep the field alive, and as it goes out, so too will the practical and ideological goals of social scientists extinguish.

\bibliographystyle{ieeetr}
\bibliography{/home/alex/Dropbox/masterbib}
\end{document}

\subsection{Other Causes}
\subsubsection{Institutional}

Sometimes institutions can put a severe burden on negative results, causing them to be file-drawered.  For example, the relationship between official policy in psychology and negative results has changed over time.  The American Psychological Association changed its perspective on ``negative results'', initially requiring a heavy ``burden on methodological precision'' for those who report negative results in 1974, to ``Negative results should be accepted as such without an undue attempt to explain them away'' (Sommer, 1987; pg. 239).  Journals often encourage reviewers to scrutinize non-significant results much more severely than positive ones.  They will ``request additional analyses using different statistical approaches or additional experiments involving different frequencies, different modulation patterns or irradiation parameters, additional cell lines, in vivo systems or end points, longer or shorter exposure times, etc.'' while taking significant results at face value (Rockwell et al, 2006).

I do not discuss the content of data sharing in detail.  However, some advances vvin artificial intelligence and knowledge engineering may be instructive.  By using formal ontologies, it is possible to create automated scientific agents that test hypotheses and run experiments (King et al., 2009).  The ``meta-data'' they collect can be shared with other artificial agents that share the same language.  Although we are far from this goal, creating formal ontologies will help determining what content to share and how to represent it.  It also highlights the language problems inherent in scientific communication. 

My solution to the signaling game is prospective publishing, where the protocol and justification for a study (what is usually the introduction and methods section) are accepted for publication before the experiment is conducted (Godlee, 2001).  If one takes the time to write up a low risk high quality methodology, where the methods are well understood, or a high risk but innovative experiment, where many nuisance variables may lurk in a new and poorly understood area, then it requires little more to complete it.  If this methodology is submitted before the data are collected, and the journal editors judge that the theory is sufficiently interesting, and methods of sufficient quality, that it could be published regardless of the outcome, then we have no perverse incentive for not publishing anomalous data.  All methodologically sound and interesting data are publishable, regardless of the outcome.  

\subsection{Frequentist probability}

The different mathematical approaches to decision based on data have different conceptions of probability.  The Frequentist approach, beginning with Jacob Bernoulli (1713) and Johan Bernoulli (1727), Venn (1888) and Galton (1888), axiomatized by Von Mises (1957), and applied by Ronald Fisher (1956), Jerzy Neyman and Egon Pearson (1928; 1933), and Deborah Mayo (1996), set out to make these decisions based only on observable, ``objective'' events.  This approach is called the frequency interpretation of probability.  

Von Mises (1957) set the groundwork of the frequency interpretation of probability.  For Von Mises, probability did not apply to individuals or individual events, but only well-defined collectives or reference classes, ``we must not think of an individual, but of a certain class as a whole, such as all insured men forty-one years old living in a given country and not engaged in certain dangerous occupations'' (Von Mises, 1957; pg. 11).  Again, he says a reference class or collective ``must exist before we speak of probability'' (pg. 12).  Thus probabilities apply only to collectives and predictions cannot be made for individuals.  They are specific statements about frequencies of observable events in a collective. 

Before we continue, we must separate a statistical hypothesis from an explanatory theory.  An example of a statistical hypothesis is that the probability of getting a one on a six-sided die is 1 in 6.  An explanatory theory is that the center of gravity of the die determines the probability of each side.  The Frequentist statisticians for the most part limit their discussion to statistical hypotheses.  

\subsection{Neyman's inductive behavior and confidence intervals}

For Jerzy Neyman, a scientist is a ``crook who loaded the dice''.  The crook understands the possible ``long-run relative frequencies of events'', builds a hypothetical model of how to manipulate these frequencies, and uses this model to ``deduce rules of adjusting our actions (or 'decisions') to the observations so as to ensure the highest measure of success'' (Neyman, 1977; pg. 99).  For him, this deduction is a `problem of mathematics' (pg. 100).  

Science proceeds by making a model that guesses about the ``Frequentist consequences in situations not previously studied empirically'', eventually establishing enough control (or getting lucky) so that there is ``reasonable agreement'' between model and consequences, and ``one feels the satisfaction of having 'understood' the phenomenon.''  This is, however, not the end, as ``invariably new empirical findings appear'' requiring the 11abandonment or modification'' of the original model (Neyman, 1977; pg. 101).	

So far described, Neyman's approach is identical to Fisher's.  In fact, Neyman greatly admired Fisher when he began his initial work, although the relationship eventually devolved into bitter dispute.  

One reason for dispute between Fisher and Neyman is that Neyman refused to admit scientific reasoning or inference was the necessary goal of Frequentist statistics.  Instead Neyman believed scientists only needed guides for inductive behavior (Lehman, 1993). That is, Neyman's approach was behavioristic, likely influenced by the concurrently popular psychological school of behaviorism associated with B.F. Skinner.  Neyman sought to provide ``rules to govern our behavior'' about when to make a decision to accept or reject a statistical hypothesis, rather than determine that an explanatory hypothesis is true or false.  He created a method for deciding whether the frequency outcomes of an experiment diverge from that predicted by a stochastic model.

Neyman proposes a string of heterogeneous situations representing hypothesis tests of H versus some set of alternatives.  One specifies the null hypothesis, and a set of alternative hypothesis to it.  The decisions work as follows.  One judges that if the null hypothesis is true then ``action A would be preferable to B'' (Mayo, 1996; pg. 371).  On the other hand, if any of the alternative hypotheses are true, then ``action B would be preferable to A.''  An experimental outcome suggesting the null hypothesis is false would lead one to choose action A, with a known type 1 error.  If the experimental outcome did not reject the null hypothesis, one would choose action B with a known type 2 error.

In each test, the hypotheses need not be the same.  His method, by appealing to the central limit theorem, guarantees that the relative frequency of type one errors will converge to the average of the alphas and the relative frequency of type two errors will converge to the average of the betas in the long run.  Thus, the theory describes the errors of inductive behavior.  He argues that these not need be ``repeated samples from the same population'' to have this property, and thus as a rule of inductive behavior applies to human action generally, not just with one specific hypothesis or circumstance.

Neyman also provided the idea of and mathematical formulation of confidence intervals.  Neyman gave them a similar interpretation as inductive behavior, where one asserts that a parameter of interest is contained in a confidence interval of probability alpha, one guarantees that ``relative frequency of correct assertions will be close to the selected alpha'' (Neyman, 1977; pg. 119).  This again appeals to long run consequences of using confidence intervals for inductive behavior for any hypotheses, all the same or all different (Neyman, 1937).

\subsection{Pearson's powerful tests of alternative hypotheses}

Egon Pearson, on the other hand, was not concerned with inductive behavior, although he let Neyman formulate their research program in this light.  Pearson`s main contribution was recognizing that an alternative hypothesis must be specified to have an optimal test of the null hypothesis (Pearson, 1955).  Pearson's approach chooses a null hypothesis and a set of alternative hypotheses and then finds the ``test to have maximum discriminating power within a certain class of hypotheses'' (Pearson, 1947; pg. 143).  He then proposes three steps: 1) define the ``experimental probability set'' or sample space for a repetition of experiments testing these hypotheses, 2) create boundaries such that ``as we pass across one boundary and proceed to the next'' we are ``more and more inclined, on the information available, to reject the hypothesis tested in favor of alternatives.''  This is the likelihood ratio. 3) Calculate the probability of random sampling giving a result beyond any contour level specified in 2 (the p-value or type 1 error). There are parallels to acceptance sampling, but Pearson's approach preceded it.

For Pearson, hypothesis testing proceeds by the ``soundness of scientific judgment'' which a statistical test can only help make rigorous.  This will depend `very little on whether very little on whether his back- room calculations have been based on inverse or direct probability or on an appeal to Fiducial argument'' (Pearson, 1947; pg. 142).  In Pearson's pragmatic view, Frequentist probability is of high value, because ``in certain problems probability theory is of value because of its close relation to frequency of occurrence'' (Pearson, 1947; pg. 143). Although Bayesian and Fiducial probability are possible, and even logical, they are not very useful.  This is a pragmatic argument.  Bayes is a good guide of action, but cannot inform action, whereas the opposite is true for the Frequentist approach.  

Although many important problems are not repetitions, as Frequentist probability requires, it may be that ``hypothetical repetition helps to that clarity of view needed for sound judgment'' or that it ``should result in a long-run frequency of errors in judgment which we control at a low figure'' , but, diverging from Fisher and Neyman, Pearson does ``not care to dogmatize'' (Pearson, 1947; pg. 142).

Victoria Stodden's (2011) roundtable on data and code sharing discusses and proposes methods of documenting research in statistics, varying from meticulous version control of all documents related to a research project (Knepley, 2009), to ``Sweaving'' links to data and computational analysis into digital documents (Donoho, 2011).  

\subsection{Conventionalist Bayesian predictive checks}

Finally, a third and newly emerging group combine the conventionalist approach of the Neyman-Pearson Frequentists with the Subjectivist Bayesian approach (Gelman and Schalizi, 2011).  I call them conventionalist Bayesians.  They use the strengths of each approach to complement the weaknesses of the other.  
The strengths of the Bayesian approach are embraced.  They allow modeling of unrepeated and unrepeatable observations, unobservable causes, and use prior knowledge for regularization.  They decide on a prior tentatively, allowing data to help justify it.  If the fit of a model from an assumed prior is poor, they are willing to throw out the prior.  

They also use a Frequentist decision-based philosophy to compensate for the limitations of the Bayesian approach, which requires logically omniscient priors or uncomputable functions (Danks and Eberhardt, 2009; Osherson, Stob, and Weinstein, 1988; Juhl, 1993).  For a Bayesian, if the cause of the anomaly in the data is not in the support for a prior, a Bayesian has little hope of discovering the cause, regardless of the amount of data obtained (Gelman and Schalizi, 2011).  So, ``thinking Bayesian'' is only provisionally helpful, affording us formal inductive rules only as long as we are willing to pretend that we are logically omniscient.  

The key feature of this program is that priors are assumed, subjectively, but then the consequences of assuming that prior are checked, and the prior is revised if necessary.  These predictions derived from a model and prior are called posterior predictive checks.  If the assumed prior and models do not fit the data well, they go back and revise the prior and/or model to be more suitable.  For Gelman and Schalizi (2011; pg. 19) ``falsification is about plots and predictive checks, not about Bayes factors or posterior probabilities of candidate models.''  

The Bayesians provide us with a way of representing our beliefs, including beliefs in hypotheses.  Using this mathematical formalization, it is possible to invert the process and discover what the source of anomalies in the data are, whether a faulty hypothesis, or faulty auxiliary.  The coherence that results is not a justification, ``an author using probabilities to express uncertainty must accept the burden of explaining to potential readers the considerations and reasons leading to the particular choices made.  The extent to which the author's conclusions are heeded is likely to depend on the persuasiveness of these arguments, and on the robustness of the conclusions to departures from the assumptions made '' (Kadane, 2011; pg. 6).  

Bayesians can make sense of any data.  For Bayesians, there is no statistically significant threshold, type 1 error, or type 2 error; only belief and subjective consequences of decisions based on belief.  As a result, failing to share data creates a loss to the consumer of the data, which can be objectively evaluated once the subjective probabilities and utilities that are consequences of this data are known.  Although Frequentists balk when there is no possibility of repetition, when there are unobserved causes, and when samples sizes are small, Bayesians see all such data as valuable.  It makes no sense from the Frequentist point of view (Neyman, Pearson, and Fisher would agree) to conduct an experiment where one has close to zero probability of rejecting the null hypothesis (Power=0).  However, Bayesians argue that data is almost never worthless in our decision-making, scientific or otherwise.

Their approach explicitly separates ``rejection and disproof'' (Lakatos, 1978; pg. 25).  One can reject data or a hypothesis by decision, or convention (i.e., it is convenient), but we cannot prove data or a hypothesis are false.  That is, the Frequentist tests are not designed to accept or reject the explanatory hypotheses we are interested in.  They are, instead, designed to accept or reject the statistical hypothesis that the data were likely under a given distribution determined by some data generating mechanism.  

The problem with the Frequentist hypothesis testing approach is this: we have data that are inconsistent with a statistical hypothesis that we choose to reject.  We do not have guide to what went wrong with the rejected hypotheses, nor how the statistical hypothesis, if rejected, relates to our explanatory theory.  We have no formal guidance.  

We can use logic to try to pinpoint inconsistencies and faults, but these will be numerous.  Indeed, Falsificationists and Frequentists provide no solution to Duhem's problem of underdetermination, where one can never logically pinpoint the source of a failed prediction \cite{laudan1990demystifying}.  Mayo provides methodological guidance, but her solution is to evaluate error with no guidance on which errors to look at.  She merely instructs us to construct and examine an unprioritized list of possible errors.  Mayo's solution to the Duhem problem is that any auxiliary hypothesis that has not passed a severe test can be examined, ``Scientists do not succeed in justifying a claim that an anomaly is due to an auxiliary hypothesis by showing how their degrees of subjective belief brought them there.  Were they to attempt to do so, they undoubtedly would be told to go out and muster evidence for their claim, and in doing so, it is to non-Bayesian methods they would turn'' (pg. 109) \cite{mayo1996error}.  In contrast, the Bayesian method can suggest to us where the error may lie.

What scientists need is some formal way to assess potential faults, not only on their logical coherence, but also on how probable they are.  For example, in a psychological experiment, it is logically possible that people didn't understand my instructions and weren't motivated, but one may seem more probable, and more worth investigating, than the other.  As Mayo says, one needs to carefully investigate error.  But one also needs a formal method to create a prioritized list of what errors to investigate first.

\subsubsection{Managing uncertainty book}
\subsubsection{Commmunicting uncertainty}
\subsubsection{Meta-analysis}

The following seven principles should be kept in mind when developing software that implements the above prescriptive analysis.
1) Enhance field of vision
a. Always be expanding the set of explanations, variables, and hypotheses.
b. Keep parsimony in mind.
2) Create maps, not points
a. Research is a map of results, not an individual result
3) Know where you’re going
a. Maximizing information gained
b. Maximizing a treatment effect
c. Maximizing hypothesis separation
4) Know where you’ve been
a. Keep accurate records
5) Know when to stop 
6) Know where to start
7) Know what to think
a. Proper Bayesian updating

Descriptive research. We anticipate that the project will highlight the need for empirical studies to inform our intuitions regarding barriers to effective inference or suggest new hypotheses worth exploring. This has been our research strategy under current support (see Section IX), which has included experiments asking new questions about two venerable studies. One direction in the new research is studies of individuals' ability to create, understand, and intervene on Bayesian networks with non-ideal interventions. Such research would build on studies such as Bruine de Bruin et al. (2009), Fischhoff et al. (2006), Kemp and Tenenbaum (2008), Steyvers et al (2003), Lognado and Sloman (2004), and Mashinghka, Kemp, Tenenbaum and Griffiths, (2004).  We will also conduct secondary analyses of existing studies, characterizing their methods and reporting practices in the terms of our framework. We are now completing such an analysis of field trials of methods for reducing consumer electricity consumption (Davis et al., 2011), finding both deficient reporting and, where one can tell, common biases in assignment to experimental conditions. Applying the current version of our framework suggests that they may have substantially overstated the effects of their interventions. However, these studies are largely in the gray literature of consultant reports; peer-reviewed publications might look quite different.

Descriptive research on how human learners handle non-ideal interventions

Stage 2 involves conducting experiments where auxiliary hypotheses, such as the experimental manipulations being ideal, may have been violated. However, descriptive research relevant to stage 2 is lacking. Studies of human ability to perform causal learning tasks have focused on ideal interventions (Gopnik, Glymour, Sobel, Schulz, Kushnir, Danks, 2004; Tenenbaum, Griffiths, and Kemp, 2006; Griffiths and Tenenbaum, 2005). For example, Steyvers, Tenenbaum, Wagenmakers, and Blum (2003) allowed participants ideal interventions to discover the communication network of alien mind readers. Lagnado and Sloman (2004) had participants learn a simple three variable chain using observations or ideal interventions of the relationship between variables such as acid levels, ester levels, and perfume, or pressure, temperature, and launch of a rocket.

Unfortunately, ideal interventions are rarely possible in the real world, if at all. Scheines (2006) discusses ways interventions can go wrong or be fat-handed. A fat-handed intervention may directly change the value of a dependent variable along with the value of an independent variable, making it impossible to establish causality. For example, one may be interested in the effect of Tylenol on cold symptoms. The researcher randomly assigns Tylenol or placebo to a representative sample of people who have a cold and finds that symptoms are reduced in those in the Tylenol group compared to control. However, double-blinding was not used, and those who received the Tylenol reported less severe symptoms. Here, the intervention (administration of Tylenol or control) affects both the presence and absence of Tylenol in the two groups, but also directly affects the dependent variable of cold symptoms. This can be extended where the targets of the intervention may not be known a priori, as is true in biomedical research (Eaton and Murphy, 2000).

We propose to conceptualize the task using relational probability models (Russel and Norvig, 2009). Here, the core theory has a Directed Acyclic Graph (DAG) structure given a configuration of auxiliary variables when they are all met. However, one generally does not know, and must discover, what the structure of the core theory is when one or more the auxiliary hypotheses are not met. That is, uncertainty about these auxiliary variables, this leads to uncertainty about structure of the core theory and clouds inference. Auxiliary hypotheses viewed this way are multiplexers (Russel and Norvig, 2009), where the truth or falsehood of the auxiliary hypotheses changes the structure of the core theory. An example of a multiplexer is given by Russel and Norvig (2009) in the context of book recommendations. If the person writing the book recommendation is honest, then the quality of the book is a cause (parent) of the recommendation. However, if the recommender is dishonest, then the quality is independent of recommendation. 

Within this framework, we propose to conduct empirical research on how human learners use non-ideal interventions to learn causal structure. We seek to answer questions such as, are people able to learn from fat-handed interventions, or do they form erroneous beliefs? Can learners discover the form of fat-handedness such as affecting the dv and IV simultaneously, or affecting a latent cause of the dv, or can they tell the difference between an ideal and fat-handed intervention? Are humans capable of discovering an auxiliary hypothesis that works as a multiplexer? 

There are other reasons, including institutional, statistical, psychological, competition, ad-hoc methods, no exact replication, and lack of interest, among others. 

\section{Changing how we think about Data Sharing}
Reprise Wason/Experimental surprises

Are non-significant results really useless, non-discoveries?  Some argue that ``When you do work that likely could have found something if it was there, and don't, that's a modest discovery. It should be part of the scientific record'' (Spurt, 2009).  Some say ``boring results are important: minimally, publication of a null result may save some hapless graduate student from spending three years trying to demonstrate an effect that's not there.'' (Bishop, 2011)
\section{Changing Institutions}
\subsection{Education}
Teach people breeding orchids

\subsection{Changing Journals}
``There must be more opportunities to present negative data. It should be the expectation that negative preclinical data will be presented at conferences and in publications. Preclinical investigators should be required to report all findings, regardless of the outcome. To facilitate this, funding agencies, reviewers and journal editors must agree that negative data can be just as informative as positive data. 

Any institutional changes will still depend on the honesty of researchers.  If blogs and wikis are used, the community still must trust each other to post blogs and wikis honestly.  The problem of data sharing will not go away.

High risks require decisive action, and the International Committee of Medical Journal Editors now requires clinical trials to be registered at the outset before publication (Shafer).  GlaxoSmithKline was successfully sued for suppressing negative results on the antidepressant Paxil (Economist, 2004).  Mandatory registration is likely necessary, as efforts to allow pharmaceutical companies to voluntarily opt-in to trial registries failed (Economist, 2004).  However, in hot disciplines (e.g., Clinical Medicine, Molecular biology), where attention has been paid to the file-drawer problem, have as of yet found no decline in ostensible data suppression (Vergano, 2011).  A strong ethical stance to require full reporting seems necessary, as Chalmers (1990) puts it “Failure to publish an adequate account of a well-designed clinical trial is a form of scientific misconduct that can lead those caring for patients to make inappropriate treatment decisions.”  

\subsection{Changing Peer Review}

``Journals and grant reviewers must allow for the presentation of imperfect stories, and recognize and reward reproducible results, so that scientists feel less pressure to tell an impossibly perfect story to advance their careers. (Begley and Ellis, 2012)''

``Journal editors must play an active part in initiating a cultural change. There must be mechanisms to report negative data that are accessible through PubMed or other search engines. There should be links to journal articles in which investigators have reported alternative findings to those in an initial (sometimes considered landmark) publication. One suggestion is to include ‘tags’ that report whether the key findings of a seminal paper were confirmed.” (Begley and Ellis, 2012)

In psychology, these failed replications are never published, both in the case of Daryl Bem's work on ESP (http://www.richardwiseman.com/BemReplications.shtml) and Bargh’s work on unconscious thought (http://www.psychfiledrawer.org/).  This has led to a strong desire for a formal independent verification system (http://openscienceframework.org/project/shvrbV8uSkHewsfD4/wiki/index).

The file-drawer problem was one of the main attributes of Nobel Laureate Richard Feynman’s famous lecture on Cargo Cult Science (1974).  For Feynman, science requires a “kind of utter honesty” where one must “report everything that you think might make [an experiment] invalid—not only what you think is right about it...details that could throw doubt on your interpretation must be given, if you know them…If you make a theory, for example, and advertise it, or put it out, then you must also put down all the facts that disagree with it.”  Failing to do this is “to a large extent in much of the research in Cargo Cult Science.” Feynman provides a famous example of the file-drawer problem, with Millikan’s measurement of the charge of the electron “When they got a number that was too high above Millikan’s, they thought something must be wrong—and they would look for and find a reason why something might be wrong.  When they got a number closer to Millikan’s value they didn’t look so hard.  And so they eliminated the numbers that were too far off” (Feynman, 1974).  Feyman’s prescription is to be determined to “publish it whichever way it comes out.  If we only publish results of a certain kind, we can make the argument look good.  We must publish both kinds of results.”

A recent case in particle physics shows how anomalous results can be handled well.  OPERA physicists observed faster than light neutrinos.  In a demonstration of good behavior they shared the data, invited competitors in to critique their instruments and were eventually able to trace the cause of the result to “a cable that was not fully screwed in.”  They shared both their success and flaws, without delay, and correction was quickly provided (Anonymous, 2012).	

Negative results section in the journal of Cerebral Blood Flow and Metabolism. (Dirnagle and Lauritzen, 2010)

Proliferation of journals of negative results: http://www.jasnh.com/; http://www.jnr-eeb.org/index.php/jnr/index; http://www.jnrbm.com/; http://jinr.site.uottawa.ca/; http://www.arjournals.com/ojs/index.php?journal=Chem&page=index; http://jcrsci.org/; http://www.pnrjournal.com/; http://jspurc.org/intro2.htm; http://www.arjournals.com/ojs/; The Journal of Interesting Negative Results; The Journal of Negative Results in Ecology and Evolutionary Biology; The Journal of Negative Results in Biomedicine; The Journal of Negative Results in Speech and Audio Sciences; The Journal of Pharmaceutical Negative Results; http://www.psychfiledrawer.org/; http://sites.google.com/site/cujonr/

\chapter{Data Sharing Policies}
When making moral or ethical decisions, sometimes we'd like to know ``what would Jesus do?'' (or substitute your favorite moralist/god/demigod).  Likewise, when we get anomalous data and need to decide whether to share them, we'd like to ask ``what would Popper do?'' or ``what would Fisher do?''. This section attempts to derive these rules.  I pretend to ask each person the following question:  I have collected data that disconfirm my hypothesis.  I do not understand the data.  Should I share these data with the scientific community?  When the philosophers and statisticians cannot speak for themselves, I try to speak for them.  The words are mine alone.  

\section{What would Kuhn do?}
My first interpretation of Kuhn is anarchistic: the only rule is no rules.  He argues that scientist needs no rules, will resist them, and the government is hesitant to impose on the scientific community.  Scientists will do what they do, sometimes failing, sometimes succeeding, but always self-correcting.  Thus, for data sharing and handling anomalies, Kuhn has no prescription, ``We therefore have to ask what it is that makes an anomaly seem worth concerted scrutiny, and to that question there is probably no fully general answer''(pg. 82) \cite{kuhn1996structure}.

My second interpretation goes a little further.  Kuhn might argue that scientists should ignore anomalous data and not share it.  He says, ``the scientist who pauses to examine every anomaly he notes will seldom get significant work done'' (pg. 82).  Thus, if data are worth ``concerted scrutiny'' we seem confident in sharing them with others.  If not, then we risk not getting ``significant work done'' (pg. 82).

[integrate results from dissertation]

\section{What would the Positivists do?}
I present Popper's straw man of two important Positivists, Carnap and Neurath, here.  Popper argues that early Carnap would share any data he collected because he equated basic records of visual sensation (``protocol sentences'') with true scientific statements, and that there need not be more than this.  On the other hand Neurath argues that sometimes we may want to delete protocol sentences rather than other sentences in our formal system; that is, Neurath allows that we should sometimes throw out the data.

[integrate results from dissertation]

\section{What would Popper do?}
A first take on Popper is that he didn't know.  Popper asks for ``a set of rules to limit the arbitrariness of `deleting' (or else `accepting') a protocol sentence.  For without such rules, empirical statements are no longer distinguished from any other sort of statements.  Every system becomes defensible if one is allowed (as everybody is, in Neurath's view) simply to `delete' a protocol sentence if it is inconvenient'' (pg. 78) \cite{popper2002logic}.  He felt that Carnap's proposal of requiring that all protocol sentences be true was not tenable, but also Neurath's position of allowing deletion of protocol sentences without clear guidance was also not admissible.  

A second take on Popper would be that he would suggest only sharing data if they increase the falsifiability of our theory.  In Popper's falsificationism, we should not consider accepting or rejecting data themselves, but instead consider how the acceptance or rejection of the data affect the falsifiability of our system.  If they increase it, accept, if they decrease it, reject.  That is, Popper wants to avoid the conflict between data and hypothesis by looking at the consequences of accepting or rejecting data.  Popper says that if one wants to reject data, one must formulate a logical assertion about why the data should be rejected, and that this assertion must itself be falsifiable.  

It is clear from Popper, however, that if one does not have a non-contradictory axiomatized theory, one cannot have falsifiers of the theory. Therefore, formal theory is necessary, but not sufficient, for sharing falsifying data of that theory.

[integrate results from dissertation]

\section{What would Lakatos do?}
Lakatos is pretty clear on what data should be shared: New facts that are empirically confirmed.  Lakatos finds refutations and anomalies useless, but a single confirmed novel prediction teaches us the most: ``exemplum docet, exempla obscurant'' (pg. 36) \cite{lakatos1980methodology}.  For Lakatos, the ``only relevant evidence is the evidence anticipated by a theory'' (pg. 38).  We may ``reject the facts as monsters'' when they disconfirm our expectations, because facts are merely interpretive theories to be compared against explanatory theories (pg. 45).  
[integrate results from dissertation]

\section{What would Fisher do?}
As mentioned in Chapter Two, Fisher is the only figure discussed here who clearly states what should be done with disconfirming or non-significant results, ``it is usual and convenient for experimenters to take 5 percent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard'' (pg. 1244) \cite{lehmann1993fisher,fisher1935design} and that, ``If P is between 0.1 and 0.9, there is certainly no reason to suspect the hypothesis tested'' (pg. 1244).  This is what (most) psychologists do.  Although I risk overgeneralizing, this probably also applies to most social scientists, if not all scientists.  This highlights the influence of Fisher's ideas.  Because Fisher would not admit prior probabilities of theories, he ignores the fact, detailed in Chapter Three, that if one is generally poor at picking true hypotheses then statistically significant results can signal error more than non-significant results.  This was the case with Daryl Bem who had a false hypothesis with statistically significant results.  Following Fisher's prescription could lead to the counterintuitive result, as described by Ioannidis \cite{ioannidis2005most}, where publishing only statistically significant results means publishing only errors.  The results of Chapter Five affirmed this, where participants were more likely to choose triples that did not fit the rule but attributed error more to disconfirming feedback than affirming feedback. 

\section{What would Neyman and Pearson do?}
The guess as what Neyman and Pearson would do depends on Mayo's guidance.  Mayo portrays Pearson as disagreeing with any automatic rule of probability, ``Were the action taken to be decided automatically by the side of the 5\% level on which the observation point fell, it is clear that the method of analysis would here be of vital importance. But no responsible statistician, faced with an investigation of this character, would follow an automatic probability rule'' (pg. 386 as cited in \cite{mayo1996error}).  Furthermore, she finds in Neyman and Pearson \cite{neyman1928use} that if one $P$ value is greater than $0.05$, and one is below, this should not matter substantively. 

[integrate results from dissertation]

\section{What would Mayo do?}
For Mayo, all data have the potential to reveal error.  Because all learning is learning from error, all data should be shared.  Mayo especially condemns any strategy of reporting data that would misrepresent the error probabilities to the consumer, for example ``treating pre-designated and post-designated tests alike'' (pg. 296) \cite{mayo1996error}.

Mayo's objection to hunting for statistically significant correlations follows from her error probability principle (EPP): ``An NP procedure of inference is inadmissible if its error probability characteristics are inconsistently reported or if it prevents the determination of valid error probabilities'' (pg. 297).  Thus, for Mayo, any procedure that invalidates error probabilities, which failing to share data arguably does, is inadmissible.  Thus, Mayo would recommend sharing any data as long as the error probabilities of those data are accurately captured. 

Mayo also agrees with Rosenthal's fail-safe N, ``Robert Rosenthal, a leader in the relatively recent area of meta-analysis, discusses how one might estimate the degree of damage to any research conclusion that could be done by the file-drawer problem'' (1987; pg. 223). This attempt to estimate and subtract out the effect of studies remaining in file drawers is, in its intent, very much in the spirit of the error statistical program'' (pg. 313). 

[integrate results from dissertation]

\section{What would the Bayesians do?}
I can't imagine any Bayesians arguing for not sharing data.  The only case when data are unambiguously uninformative is when they are equally likely under all hypotheses.  This is not sufficient for data to be uninformative to another person.  The Bayesian must also believe absolutely that another person also believes the data are equally likely under all hypotheses the other person considers. 

[integrate results from dissertation]

\section{Pedagogical Data Sharing}

The previous two sections evaluated two conjectures that justify not sharing data: the differential diagnosticity conjecture and the blaming the method conjecture.  This section evaluates policies that a helpful rational agent should follow when sharing data.  I consider two agents, the teacher who is sharing data, and the learner who is receiving the data.

One option is for the teacher to share all of the data.  Lemma 1 from Schalizi and Crutchfield \cite{shalizi2001computational} says that one can do no better in pattern prediction than including all previous histories of a stochastic process (all data).  This is called complete sampling.  Although complete sampling is assured to be maximally informative, it can also be unnecessarily complex.  If the teacher can figure out how to share only the most helpful instances with the learner, that is, the smallest data set that produces the maximum predictive power (equivalent to the predictive power the learner would have if she accessed all of the data), then it is likely that the learner would perform better than the complete sampling case.  For example, if one can identify the causal states of a system \cite{shalizi2001computational}, then share less than all of the data but perform just as well.  This is called \emph{pedagogical sampling} \cite{shafto2008teaching,shaftolearning,shaftoepistemic}.

Pedagogical sampling involves a teacher choosing samples to give to the learner, and the learner inferring concepts from both the samples given and knowledge of the pedagogical sampling strategy \cite{shafto2008teaching}.  That is, if the learner knows the teacher is rational and well-intentioned, providing the most informative samples, then induction is greatly facilitated.

In general, any pedagogical data sharing strategy depends on the following assumptions: 
\begin{enumerate}
\item That the teacher knows the hypotheses considered by the learner.
\item That the teacher has some idea about the prior probabilities the learner assigns to hypotheses.
\item That the teacher and learner reasonably agree on the meaning of the data.
\item That the teacher and learner agree that the teacher is sharing using a pedagogical strategy.
\item That the teacher has some knowledge of how much data the learner can attend to without tuning out or becoming confused.  
\end{enumerate}

All of these are likely to be false in the real world.  In the general case, the teacher knows very little about the hypotheses the learner considers, nor how the learner will interpret any fixed dataset shared with her.  Thus, I see no general solution to the problem of pedagogical data sharing.  If one shares all the data and the learner has perfect attention and doesn't get confused, one can perform no better, and this is assured.  Thus, the strategy that minimizes the chance of the worst outcome (the minimax regret strategy) is to share all of the data; there are no data points that are unambiguously worthless.  

In the following analyses, I make one or more of the assumptions 1-5.  Consequently, the conclusions should be at least as tentative as the truth of these assumptions.

First, take the Wason 2-4-6 rule discovery task as an example.  Suppose an oracle tells the teacher the true hypothesis (the true hypothesis in our Wason task was ascending, consecutive, evens, from $2-100$), and that feedback is deterministic.  The teacher can collect data (propose triples) and receive feedback on whether the triple fits the rule or does not, and use this feedback to convince the learner of this hypothesis by sharing the data that she sees as most convincing.  What data will the teacher collect and share?

Shafto, Goodman and Frank \cite{shaftolearning} describe how the learner can use the perceived goals and intentions of the teacher to make better inferences.  If the learner assumes pedagogical sampling, this entails that the teacher will provide demonstrations that give the maximum amount of evidence while also eliminating alternative hypotheses if possible.  Thus, if the teacher knows the true hypothesis and the learner makes the pedagogical sampling assumption, the teacher can convey the true rule with only a fixed set of datapoints, although it is not unique.  This is the minimally sufficient set to demonstrate the hypothesis (the set of causal states from Schalizi and Crutchfield \cite{shalizi2001computational}).  If feedback is deterministic, then demonstrating joint sufficiency of the five elements of the Actual Rule in the Wason task ($A=\text{ascending}$, $B=\text{consecutive}$, $C=\text{evens}$, $D=\text{greater than 2}$; $E=\text{less than 100}$) can be done by showing that when all five elements are present, the feedback indicates that the triple fits the rule (FIT).  That is one must demonstrate the following implication $(A \land B \land C \land D \land E \rightarrow Fit)$  by modus ponens.  This only requires 1 trial to do.

To demonstrate that each element is not \emph{n}-wise sufficient (that is, no combination of the five elements other than all five together is sufficient) but all are necessary, the following must be done.  One demonstrates this joint sufficiency by showing that the absence of any of the five constituents leads to the triple not fitting the rule; that is the implication $(\neg A \vee \neg B \vee \neg C \vee \neg D \vee \neg E \rightarrow \neg Fit)$, is valid by modus tollens.  By removing elements in 1-way, 2-way, 3-way, 4-way, and 5-way combinations, one can demonstrate joint sufficiency of all 5 elements and that no subset of the 5 elements are jointly sufficient.  The entire process can be done in ${5 \choose 1} + {5 \choose 2} + {5 \choose 3} + {5 \choose 4} + {5 \choose 5} = 31$ trials.

Unfortunately, any such demonstration (and there are an infinite number of them) is also consistent with an infinite number of alternative hypotheses.  Thus, if the teacher is told the true rule by the oracle and the teacher and learner have common knowledge of pedagogical sampling, then the teacher can convey the correct rule in 31 trials, although the rule conveyed will not be unique.  For the learner to have any chance of guessing correctly, she must have prior knowledge that also shifts her judgment toward these 5 elements and not others (e.g., the sequence must begin with a number with the letter \emph{t} in it).

Now consider a more restrictive set of assumptions.  Suppose that the teacher and learner agree on the logical relationship between data and each potential hypothesis.  That is, $P(D|H)_{t}=P(D|H)_{l}$.  Although they don't necessarily have the same prior beliefs about hypotheses, they evaluate each data point in the same way.  Also suppose that the hypothesis spaces for both learner and teacher have the same support and both include the true hypothesis; that is, they both assign non-zero probability to the same hypotheses and non-zero probability to the true hypothesis.   

What data may the oracle-informed teacher be able to exclude in this case? After a sequence of \emph{n} trials collecting data $d_{1},d_{2},...,d_{n} \subset D$ the teacher considers her own posterior beliefs $P(H|D)_{t}$ about each hypothesis.  The teacher wants to share some subset of the data such that the learner's posterior beliefs are as close to the teacher's as possible.  A data point $d_{i}$ is completely useless if it does not change $P(H|D)_{t}$ for any combination of subsets of the data.  That is, if one were to evaluate one's posterior beliefs for every combination of subsets of the data, and a specific data point $d_{i}$ has no effect on $P(H|D)$ upon inclusion or exclusion in any of those subsets, then $d_{i}$ is not informative.  Since it is not informative for the teacher, it will also not be informative for the learner because they have common likelihoods $P(D|H)_{t}=P(D|H)_{l}$.  I call this datapoint \emph{unambiguously excludable}.  

A less severe principle for exclusion would be \emph{redundancy}.  This criterion means that $d_{i}$ is exactly the same (in the sense that $P(D_{i}|H_{j})= P(D_{k}|H_{j})$ for all hypotheses $j$) as at least one $d_{k}$.  That is, one would have the exact same posterior beliefs, across all hypotheses, if one substituted $d_{k}$ and excluded $d_{i}$ from the set of data considered.  In this case, $d_{i}$ and $d_{k}$ are redundant.  If this is the true, then either $d_{i}$ or $d_{k}$ can be excluded from the shared data without any fear of harming the learner's performance.  However, if feedback is stochastic, then this won't hold, because any data point will increase the probability of some hypothesis, even in the presence of redundant data, for the same reason that having two heads more strongly indicates bias toward heads of a binomial parameter than one head.

Next, take the case more similar to the real world, where there is no oracle to tell the teacher the truth.  There is no unique set of rational strategies the teacher can follow to discover the rule if it is not provided by an oracle (although Austerweil and Griffiths \cite{austerweil2008rational} make an attempt under severely limiting assumptions about the hypothesis space); all strategies are admissible.  However, the teacher can rule out some hypotheses out of an infinite set.  At the end of the data collection phase, assume that the teacher has some finite set of hypotheses she considers possible (although the logically possible set is infinite), and a probability distribution representing the teacher's belief that each hypothesis is the true hypothesis, $P(H|D)$.

The most plausible set of restrictions on the data sharing problem is the following.  Suppose some datapoints would rule out a hypothesis, and one correctly believes the learner believes this hypothesis has low probability.  That is, the teacher's beliefs are correct, up to ordinal validity, on the sets of $P(D|H)_{l}$ and $P(H)_{l}$, for all data and hypotheses.  Since the learner must guess after receiving the data, it may not be worthwhile to rule out a low probability hypothesis for the learner.  That is, it may not be worthwhile to share data that would rule out a hypothesis that the learner wouldn't guess anyway.  

For this to matter one must also assume that attention is costly in the sense that attending to one data point reduces the probability of attending to other datapoints.  If this weren't the case, then sharing data that rule out a hypothesis that the learner wouldn't guess is harmless.  When attention matters, sharing a datapoint that refutes a low probability hypothesis may preclude the learner from considering another shared datapoint that refutes a high probability hypothesis.

Suppose the shared datapoints are exchangeable (i.e., there is no way of drawing attention to some datapoints over others once they are shared).  Call the attentiveness of the learner $\alpha \sim \mathrm{Bernoulli}(\eta)$, where if only one datapoint is shared, the probability of the learner attending to it is $\eta$. 

Suppose there are two datapoints: $d_{1}$ and $d_{2}$. If one datapoint is shared $(N=1)$, then either $d_{1}$ or $d_{2}$ will be added to the learner's posterior distribution with probability $\eta$.  That is:

\begin{equation}
  E[P(H|D)]=\eta [\frac{P(d|H)P(H)}{P(d|H)P(H)+P(d|\neg H)P(\neg H)}] + (1-\eta)P(H)
\end{equation}

If both datapoints are shared ($N=2$), then the following relationship holds $E[P(H|d_{1},d_{2},N=2)] =$:

\begin{equation}
  P(H)[\frac{\eta^{2}P(d_{1},d_{2}|H)}{P(d_{1},d_{2})}+\frac{\eta(1-\eta)P(d_{1}|H)}{P(d_{1})}+\frac{\eta(1-\eta)P(d_{2}|H)}{P(d_{2})}+(1-\eta)^{2}]
\end{equation}

According to this equation, when sharing two datapoints, both are attended to with probability $\eta^{2}$, each one is considered separately with probability $\eta(1-\eta)$, and neither are considered with probability $(1-\eta)^{2}$.

In general, if $N=k$ datapoints $d_{1},d_{2},...,d_{n} \subset D$ are shared, denote the exclusion of element $d$ from set $D$ as $D \setminus \{d\}$, then the following relationship holds:

\begin{equation}
  \begin{split}
    E[P(H|D,N=k)]=P(H)[\frac{\eta^{k}P(D|H)}{P(D)} \\
      &\quad + \eta^{k-1}(1-\eta)\sum_{i=1}^{k}\frac{P(D\setminus\{d_{i}\}|H)}{P(D\setminus\{d_{i}\})} \\
      &\quad + \eta^{k-2}(1-\eta)^{2}\sum_{j=1}^{k}\sum_{i=1}^{k-1}\frac{P(D\setminus\{d_{i},d_{j}\}|H)}{P(D\setminus\{d_{i},d_{j}\})} \\
      &\quad + \eta^{k-3}(1-\eta)^{3}\sum_{l=1}^{k}\sum_{j=1}^{k-2}\sum_{i=1}^{k-3}\frac{P(D\setminus\{d_{i},d_{j},d_{l}\}|H)}{P(D\setminus\{d_{i},d_{j},d_{l}\})} \\
      &\quad +...+ \\
      &\quad +\eta(1-\eta)^{k-1}\sum_{i=1}^{k}\frac{P(\{d_{i}\}|H)}{P(\{d_{i}\})} +(1-\eta)^{k}]
  \end{split}
\end{equation}

In Chapter 3, I found the development of equations (3.22) \& (3.23) to be quite opaque. Three issues:

(1) Shouldn’t the teacher care about minimizing the difference between her maximal probability hypothesis given all data \& the learner's given only k data? The teacher wants the learner to believe what she actually believes, but what she would believe if she were shown the data that will be shown to the learner. So, why is k in both posteriors?
(2) Shouldn't those equations use the expected distribution for the learner? The setup implies that these equations are derived for the case in which the learner does not attend to all data (and if the learner does not face attentional bounds, then why go through the combinatorics leading up to (3.21)?). And in that case the expected distribution is what matters, since there is no determinate ``actual'' one.
(3) If the learner just gets to make one guess, then why does the difference between posteriors matter? The teacher should be happy with any data that leads to the correct guess, regardless of the particular posterior that the learner has.

On p. 45, you wrote: ``If ND is true, the editor gets payoff x.'' Given that this is essentially a signaling game, shouldn't the payoff in the ND case by 0 (so the editor has an incentive not to make any payment at all)?

Suppose the teacher, feeling more informed than the learner, finds the following function suitable for evaluating data sharing, where the value of sharing $k$ of $N$ elements of data $D$ is:

\begin{equation}
  V(D,N=k)=-\|P(H|D,N=k)_{l}-P(H|D,N=k)_{t}\|_{1}
\end{equation}

Since the teacher knows the learner must make one guess, the highest posterior probability hypothesis, this function supposes that the teacher does not like any absolute difference in probability between her highest posterior probability hypothesis and the learner's belief in the teacher's highest posterior probability hypothesis after sharing $k$ data points.  Suppose the teacher chooses a dataset $D$ of size $k<N$ to share.  The teacher then wants to choose the value of $k$ so as to minimize the difference between the learner's posterior beliefs and the teacher's:

\begin{equation}
  \arg\max_{k}[V(D,N=k)]=\arg\max_{k}[-\|P(H|D,N=k)_{l}-P(H|D,N=k)_{t}\|_{1}]
\end{equation}

One can thus go through each data set of size $k$ and evaluate $V(D,N=k)$ for each one looking for the maximum.  If this maximum is unique, then that data set is the only admissible one.

Attention is not the only problem with sharing too much data.  Confusion is also likely to arise if: 1) the result of two datapoints are contradictory for some hypothesis, 2) the logical relationships are unclear (e.g., double negative), or 3) some datapoint has no clear hypothesis it is related to (i.e., it is only related to low probability hypotheses).  It is possible to model confusion in the following way.  Suppose confusion only affects the probability of the learner continuing to look at data points: $\eta$.  

First, $\eta$ will depend on the coherence or contradiction in the data, that is, the degree to which two datapoints suggest different probabilities under the same hypothesis.  More formally, the coherence of the shared data is determined by the degree to which $P(D|H)$ is the same for all data for all hypotheses.  If two datapoints give contradictory indications of a hypothesis, that is, one datapoint suggests a hypothesis is true whereas another suggests it is false, coherence is reduced.  Thus, one would not want to share trials one thinks are error because they will likely reduce the coherence in one's dataset.  Since $\eta$ is increasing with coherence, the effect of sharing error would be to decrease coherence and increase the chance that the learner will get confused and stop looking at the rest of the data.

Second, suppose a hypothesis exists where $P(H)>0.5$.  Suppose the learner represents this as $H$, rather than $\neg H$.  If this is true, then $H^{-}$ tests for $H$ which get disconfirmation (DNF) will support $H$.  However, the learner must use a double negative to determine this, that is and $H^{-}$ test means $H \rightarrow DNF$.  In other words, \emph{Doing modus tollens on a negation is hard}.

Finally, if one provides a lot of data that rule out low probability hypotheses, or hypotheses that the learner does not consider plausible, then the learner is likely to get confused and stop looking at the data.

In sum, if one is willing to admit that the person sharing the data (the teacher) knows what the learner believes, and the teacher believes she is better informed than the learner, then the teacher can try to find a subset of the data she collected that maximizes the chance of the learner guessing the correct hypothesis.  If attention and confusion are taking into account, this will be the smallest subset of collected data that assigns maximum probability to the best hypothesis the teacher considers, is not contradictory (coherent), does not have a lot of double negatives, and clearly relate to some reasonable hypothesis.  This pedagogical data sharing strategy, by taking these constraints into account, also resembles one that is adversarial, where the teacher tries to convince the learner of the \emph{incorrect} hypothesis.

\section{Adversarial Data Sharing}

Unfortunately, sharing data is not a purely pedagogical endeavor.  There are often incentives to convince others that one's hypothesis is correct even if one has private evidence that weakens, or falsifies, the proposed claim.  The person with this data can choose to share or conceal it.  The cost of sharing unconvincing data is that the data sharer's payoff is likely lowered in terms of fame, prestige, tenure, etc.

If there are incentives to share or not share data, then the data sharing activity can be seen as a game between the researcher and scientific community, where the researcher attempts to gain the maximum benefit, in terms of prestige or wages, with the smallest effort, and the scientific community attempts to discriminate between researchers who are proposing true and false hypotheses.  This is sometimes called a signaling game \cite{akerlof1970market,spence1973job} or principal-agent game \cite{fudenberg1991game}.

There are two players in the game, the sender (i.e., researcher/teacher) and the receiver (i.e., editor or scientific community/learner) \cite{fudenberg1991game}.  The sender perfectly understands what the receiver knows and the incentives for the receiver; this is called \emph{common knowledge}.  This section uses the terms researcher and editor, for contextual purposes, although they are equivalent to teacher and learner, respectively.

A researcher collects data that either leads to a \emph{true discovery} ($TD$) or \emph{no discovery} ($ND$).  In this world, the researcher does not know whether she has made the true discovery, as affirmation cannot conclusively prove.  The editor wants to pay researchers for true discoveries and pay nothing to those with no discoveries.  However, the editor does not know whether the researcher has made a true discovery or no discovery, but instead only has access to a dataset shared by the researcher and prior knowledge about the probability of the researcher's hypothesis.

As a result, the editor must make a bet on the researcher's hypothesis being a true discovery or no discovery.  Suppose the editor has a total budget of $\mathcal{A}$  (e.g., journal space; reputation of the journal; effect of the researcher's paper on the impact factor of the journal).  The editor bets $\mathcal{A}-x$  that the researcher has a true discovery and $x$  that the researcher has no discovery.  If TD is true, the editor gets payoff $\mathcal{A}$.  If ND is true, the editor gets payoff $x$.  If one scales $\frac{\mathcal{A}-x}{\mathcal{A}}$, then  $P(H=TD|D)_{E}=\frac{\mathcal{A}-x}{\mathcal{A}}$ is the editor's posterior probability that hypothesis $H$ is a True Discovery given dataset $D$; the editor's posterior odds that the researcher is correct.  The researcher's payoff, on the other hand, only depends on $\mathcal{A}-x$.  Thus, the researcher has incentive to share data to convince the editor of a believable but false hypothesis.

Suppose the researcher collected a total dataset $D$.  Call every possible combination of data the researcher could share the powerset of the data $\mathcal{P}(D)=2^{d}$.  The rational and self-interested researcher looks at every element of $\mathcal{P}(D)$ and chooses a hypothesis $H$ that has maximal probability in $\mathcal{P}(D)$, call this $H^{max\mathcal{P}(D)}$.  This maximal probability will be determined both by the researcher's probability of the data for each hypothesis, $P(D|H)$, and the prior probability the researcher assigns to the hypothesis $P(H)$.  $P(H)$ helps reduce the set of possible hypotheses to those that are plausible, and likely to be considered plausible by both researcher and learner.  Assume for simplicity that the researcher believes the editor holds similar beliefs, although these two values could be substituted with the researcher's beliefs about the editor's beliefs.

Call the dataset that provides maximal probability from $\mathcal{P}(D)$ for $H^{max\mathcal{P}(D)}$, $d^{max\mathcal{P}(D)}$.  This is the \emph{most convincing dataset} ($MCD$) from $\mathcal{P}(D)$ for the researcher's hypothesis $H$, $d^{max\mathcal{P}(D)}=MCD(\mathcal{P}(D),h)$.  The researcher, knowing this, chooses the elements of the power set of her data that do not refute her hypothesis.  The omitted data in $\mathcal{P}(D)$ from $d^{max\mathcal{P}(D)}$ are either falsifying or probability lowering for $H^{max\mathcal{P}(D)}$, called $d^{fh}$, or have no effect (are non-diagnostic) $d^{nd}$. 

\subsection{The editor is naive}

Consider the case where the editor is naive.  This means two things: 1) that the editor does not know that the shared data does not have to be all of the data, and 2) that the editor does not know that the researcher's payoff is equal to the editor's posterior odds of a true discovery to no discovery. 

With a naive editor, the researcher's choice is then to select hypothesis $H$ and dataset $d$ so as to maximize $P(H=TD|D)_{E}=\frac{\mathcal{A}-x}{\mathcal{A}}$.  This is done simply by finding $H^{max\mathcal{P}(D)}$ and then sharing $d^{max\mathcal{P}(D)}$ and omitting $d^{fh}$ and $d^{nd}$.  This also involves guessing the prior probability of the editor's hypotheses $P(H)_{E}$.

In response, since the editor is naive, the editor merely calculates $P(H^{max\mathcal{P}(D)}=TD|d^{max\mathcal{P}(D)})}=$:
  
\begin{equation}
  \frac{P(d^{max\mathcal{P}(D)}|H^{max\mathcal{P}(D)}=TD)P(H^{max\mathcal{P}(D)}=TD)}{P(d^{max\mathcal{P}(D)})} = \frac{\mathcal{A}-x}{\mathcal{A}}
\end{equation}

The editor's expected loss compared to seeing the full dataset and maximal hypothesis on that dataset is equal to the researcher's gain for being dishonest.  This can be broken into two cases.  If the entire dataset $D$ contains a falsifier of the $H^{max\mathcal{P}(D)}$, then $P(H^{max\mathcal{P}(D)}=TD|d^{max\mathcal{P}(D)})}=0$, and the editor loses $\frac{\mathcal{A}-x}{\mathcal{A}}$ compared to if she had full information from the researcher.  If the entire dataset $D$ does not contain a falsifier, then the editor loses nothing compared to the full information case.

 Thus, omitting falsifiers of the proposed hypothesis from the shared dataset significantly harms the editor's payoff while helping the researcher's payoff.  There is a perverse incentive for the researcher to not share falsifying data.

\subsection{The editor has common knowledge}

If the editor has common knowledge, then she knows that the researcher does not need to share all of the data, and that the researcher's payoff is equal to the editor's posterior odds of a true discovery to no discovery.  If the editor believes the cost to the researcher to collect data to support the proposed hypothesis is zero, then the editor should not trust the data sent by the researcher.  In this case, the editor will merely guess the proposed hypothesis given the shared data is equal to the editor's prior probability that the proposed hypothesis is correct: 

\begin{equation}
  P(H=TD|D)=P(H=TD)=\frac{\mathcal{A}-x}{\mathcal{A}}
\end{equation}

In general, signals will not be informative if the cost of fabricating evidence is too cheap (this is called pooling equilibrium).  Knowing the pooling equilibrium will occur, the researcher will select a hypothesis that has maximum $P(H)$, and share data that do not refute it. The editor will judge $H$ based on $P(H)$, and ignore the data.  Thus, the researcher can always fabricate evidence, leading to pooling equilibrium.  Unless some cost to not sharing data is imposed, one should expect pooling equilibrium and the learner should ignore the data.

The conclusion I draw in this section is that if the teacher/researcher can costlessly omit data, then she can always send a knowingly false hypothesis and the editor cannot discriminate this from one where the researcher does not know the hypothesis is false.  If this is true, then the editor should always ignore the data, and judge hypotheses only on the editor's prior probability of that hypothesis $P(H)$.  Thus, the research community must impose some cost of not sharing data to justify learning from and publishing data.

\chapter{Guiding Principles}

\setlength{\epigraphrule}{0pt}
\setlength{\epigraphwidth}{.95\textwidth}
\begin{epigraphs}
  \centering

  \qitem{``But let a man venture into an unfamiliar field, or where his results are not continually checked by experience, and all history shows that the most masculine intellect will ofttimes lose his orientation and waste his efforts in directions which bring him no nearer to his goal, or even carry him entirely astray. He is like a ship in the open sea, with no one on board who understands the rules of navigation. And in such a case some general study of the guiding principles of reasoning would be sure to be found useful.''}
        {---\textsc{Charles Peirce, 1877, The Fixation of Belief \cite{peirce1877fixation}}}
        
\end{epigraphs}

Experiments frequently yield unexpected results.  Learning the most from these results, and communicating them effectively, requires understanding what went wrong.  Without a method that does this, disconfirming surprises will be seen as incomprehensible and diffuse, as indicated by Chapters Four and Five.  Thus, a sound method of identifying the causes of error should both protect against distorted inferences and promote the sharing of valuable but disconfirming data.

The prescriptive analysis is divided into two parts.  First, to learn from disconfirmation, one's hypothesis needs a well-defined \emph{ceteris paribus} clause.  If this is not done, then surprising data will be perceived as being caused by an experiment that could have yielded any results, thus making the data not worth sharing, as indicated by Chapter Four.  Although logically valid, these uniform error models do not allow predictions that can help debug the experiment.  The approach described in Chapter Seven gives form to error, so failed predictions can be pinpointed to specific causes; that is, error models can be made non-uniform.  This is done by elaborating on five \emph{guiding principles} extracted from four important philosophers of science (Popper \cite{popper2002logic}, Kuhn \cite{kuhn1996structure}, Lakatos \cite{lakatos1980methodology}, and Mayo \cite{mayo1996error}).  They are:

\begin{enumerate}
\item Theories must be sufficiently axiomatized (Popper).
\item Tests must be conducted carefully and tenaciously, isolating sources of error (Mayo).
\item An interesting theory should be retained in the face of disconfirmations, as long as it can be modified to make interesting new predictions (Lakatos and Kuhn).
\item Commit to tracking and making probabilistic statements about ``what went wrong.'' (Mayo and \cite{ferrucci2010building})  
\item When a competitor theory clearly wins, it may be time to give up on it and engage in other pursuits (Lakatos and Kuhn).
\end{enumerate} 

This approach provides guidance on how to design and interpret experiments so that unexpected results are seen as informative rather than incomprehensible mere facts that turn into discarded anomalies.  Like Duhem's Simplicism, the approach is ``born and matured in the daily practice of science'' (pg. 3) \cite{duhem1991aim}.

Next, Chapter Eight discusses methods for documenting and communicating data.  Chapter Three proposed that that any pedagogical data sharing strategy relies on the sharer knowing a lot about what the learner knows.  Ethical and practical rules for data sharing, discussed in Chapter Two, involve imposing minimal conventions on the reader.  Descriptively, in the Wason rule discovery task of Chapter Five, participants' judgments of error were uncorrelated with actual error and less likely to be shared with another person when financial penalties were absent.  The prescriptive approach described in Chapter Eight proposes a method of carefully documenting the conventions used in collecting and reporting data, allowing better communication and providing the required structure so penalties can be implemented.

\chapter{Breeding Orchids}

\setlength{\epigraphrule}{0pt}
\setlength{\epigraphwidth}{.95\textwidth}
\begin{epigraphs}
  \centering
                \qitem{``The conduct of subtle experiments has much in common with the direction of a theatre performance,'' says Daniel Kahneman, a Nobel-prize winning psychologist at Princeton University in New Jersey. Trivial details such as the day of the week or the colour of a room could affect the results, and these subtleties never make it into methods sections. Bargh argues, for example, that Doyen's team exposed its volunteers to too many age-related words, which could have drawn their attention to the experiment's hidden purpose. In priming studies, ``you must tweak the situation just so, to make the manipulation strong enough to work, but not salient enough to attract even a little attention'', says Kahneman. ``Bargh has a knack that not all of us have.'' Kahneman says that he attributes a special `knack' only to those who have found an effect that has been reproduced in hundreds of experiments. Bargh says of his priming experiments that he ``never wanted there to be some secret knowledge about how to make these effects happen. We've always tried to give that knowledge away but maybe we should specify more details about how to do these things.''}
              {---\textsc{Ed Yong, 2012 \cite{yong2012bad}}}        
        
\end{epigraphs}

Data attributed to error are not shared with others, and these error attributions are unduly affected by whether results are affirming or disconfirming.  Although Chapter Five found that financial penalties for incorrect error attributions helped participants identify error, it is not possible to provide these incentives in the real world, as error attributions are usually an implicit part of research.  Additionally, Chapter Five also found that even when penalties promoted substantial accuracy of error attributions, data sharing policies were still inappropriately affected by whether the feedback was affirming or disconfirming.  Furthermore, Chapter Four found that unexpected results are seen as caused by error, and the degree to which this error makes predictions from the same experiment seem uniform is positively associated with decisions to not share data.  This chapter proposes methods of making precise, non-uniform error models, thus protecting data sharing policies by promoting accurate perception of error.

As with all methodologies, there is no attempt at logical proof.  Instead the approach analyzes important problems that lead to experimental surprises, and proposes methods to both avoid and learn from them.  Most of the chapter has no empirical tests.  However, to the greatest extent possible, examples are given by applying the proposed method to the methodology and results described in Chapter Five.

The approach outlined here blends work on scientific discovery by Herbert Simon and colleagues \cite{klahr1988dual,klahr1999studies,schunn1996problem}, work on causal induction by Griffiths, Tenenbaum and colleagues \cite{griffiths2009theory}, work on noisy causal inference by Scheines and colleagues \cite{scheines2005similarity,spirtes2000causation}, and work by various social science methodologists (Rosenthal and Rosnow \cite{rosenthal1991essentials}; Shadish, Cook, and Campbell, \cite{shadish2002experimental}; Luce and Narens \cite{narens1986measurement}).

To explain the approach, consider the process of breeding orchids.  Successful breeding requires a delicate balance of conditions, which may only occur with the care, cleverness, and patience of the breeder, in limited environments, and with specialized tools.  Experimentation in the social sciences often follows a similar process.  In order to study a phenomenon that interests us, investigators labor to find the delicate conditions under which it can be most reliably observed.  They then conduct experiments manipulating theoretically interesting variables, within the constraints of that microcosm.  This is the ``direction of a theatre performance'' articulated by Kahneman \cite{yong2012bad}.

Discovering these delicate conditions can provide great benefits in the opportunities that it creates to replicate basic patterns at will, and to compare results observed in conditions that vary in controlled ways.  However, it also creates the risks of limiting studies to an experimental monoculture, producing fragile results that disappear outside the ``hothouse,'' or represent theories without clear predictions for more complex settings.  Researchers concerned about these risks will invest in testing the boundary conditions for their prized results.  

Within these carefully constructed normal science microcosms there is a partial solution to the problem of deciding whether to share ambiguous, disconfirming data with others: as normal science develops, data become unambiguously informative about the hypotheses of interest.  Thus, the problem of data sharing is closely tied to the problem of managing experimental uncertainty.

Paul Meehl and his associates (e.g., \cite{meehl1990appraising,meehl1997problem,meehl2002path}) laid the foundations on the process of experimentation, which they conceptualized as the repeated choice and generation of methodological and statistical tools to learn about the core and auxiliary hypotheses of a research program.  A \emph{core hypothesis} is a set of sentences in a suitable formal language, such as first-order logic, coupled with an ontological statement about the elements of the world and how they are related \cite{meehl1990appraising,borsboom2004concept}.  An \emph{auxiliary hypothesis} is a series of sentences, in that formalism, that is used in conjunction with the core hypothesis to derive its observable consequences \cite{meehl1990appraising}. 

The proposed framework develops the core and auxiliary hypotheses through a series of five testing stages, each developed to solve a specific epistemic problem.  It is roughly patterned on that of multi-phasic medical clinical trials, whose structure is partially mandated by regulatory requirements, driven by the need to ensure that research results are characterized well enough to allow decisions about their application in matters of life and death. 

\begin{itemize}
\item In Stage 1, a \emph{representation} of the core and auxiliary hypotheses is formed out of one's entire corpus of knowledge.
\item In Stage 2, \emph{pretesting}, the auxiliary hypotheses needed to derive observable predictions from a core theory are tested directly, in ways that allow assessing and reducing their failure probabilities. 
\item In Stage 3, \emph{pilot testing}, the core is assumed to be true and its predictions are tested along with the auxiliary hypotheses known to be imperfect. 
\item In Stage 4, \emph{testing}, the auxiliary hypotheses are assumed to be true and different core hypotheses are tested against each other. 
\item Finally, in Stage 5, \emph{evidence synthesis}, the cumulative weight of evidence is examined, and a decision is made to either return to stages 1-4 or terminate the endeavor.
\end{itemize}

I lay out these stages, discuss the problems in each stage, and propose methods to solve them.

\section{Stage One: Representation}

The first stage develops and represents the core hypothesis that will be tested along with the auxiliary hypotheses needed to test it.  Rather than reasoning analytically from premises to conclusions (deduction), or inferring general principles from instances (induction), Stage One is \emph{abductive}, generating plausible explanations for phenomena and formulating a precise \emph{ceteris paribus} clause under which the proposed explanation will hold.

\subsection{Core Hypotheses}

Griffiths and Tenenbaum's Theory-Based Causal Induction \cite{griffiths2009theory} provides a method for developing inductive causal theories with structured background knowledge.  The approach works by specifying an \emph{ontology}, a set of \emph{plausible relations} between ontological elements, and precise mathematical \emph{functional forms} that these plausible relations can take, either deterministically or stochastically.  These three elements are all formalized as structured prior knowledge (what Levi \cite{levi1997covenant} calls the corpus of knowledge), making Bayesian computation possible.

\subsubsection{Ontology}

An ontology is the skeleton of any theory.  It is an explicit commitment to the entities that exist in the theory, their properties, and how these different entities (types) relate to one another \cite{schneider2011reasoning}.  The ontology is usually specified in one of a number of ways, including mereology (part-whole relationships) and topology (connectedness relationships) \cite{varzi1998basic}.  For example, one theory may involve the amygdala as a part of the brain (a mereological specification), which is physically connected to the insula (a topological specification).

Theory-Based Causal Induction \cite{griffiths2009theory} uses a simplified version of an ontology.  It involves entities or \emph{types}, each with their own specific properties.  For example, in Newton's mechanics, the relevant types were mass, velocity and acceleration.  The ontology also specifies how many of each type we are considering, or a stochastic distribution of the number of entities of each type.  For example, we may examine two masses colliding (a fixed number) or an unknown or variable number of masses colliding (stochastic).  The next element of the ontology is a set of \emph{predicates} that describe the possible causal relationships between the types, and range of values that each predicate can take (e.g., Boolean, continuous, etc.).  For example, if mass A interacts with mass B through the collision predicate, then the velocity and acceleration of both masses change.  If no predicate relates two entities, then they cannot be directly causally connected.  

\subsubsection{Plausible Relations}

Next are \emph{plausible relations}.  If types are connected by predicates then they are related in some way, but some relationships may be more plausible than others.  For example, ``a lamp is more likely than a fan to produce a spot of light, that a fan is more likely than a tuning fork to blow out a candle, and that a tuning fork is more likely than a lamp to produce resonance in a box'' (pg. 664) \cite{griffiths2009theory}.  The plausibility of the relationship is likely closely related to the presence and strength of mechanistic explanations of how a cause can produce an effect \cite{luhmann2007buckle}.  The set of plausible relationships can be represented as \emph{Directed Acyclic Graphs} (DAGs), or causal graphical models with typed variables as vertices \cite{spirtes2000causation,pearl2000causality}. 

\subsubsection{Functional Form}

Finally, a functional form specifies the exact mathematical relationship between cause and effect described by the plausible relation.  For example, in the case of Newton's mechanics, acceleration, mass and force are related by a specific functional form: $F=ma$.  If outcomes are stochastic, the functional form is a conditional probability density function.  A very simple case, the noisy-OR function, proposes that the probability of an effect is merely the weighted independent sum of its causes:
\begin{equation}
  P(effect)=w_{0}+w_{1}cause_{1}+w_{2}cause_{2}+...+w_{n}cause_{n}
\end{equation}

In this function, $w_{i}$ is the strength of each cause $i$ to produce the effect.  In more complex problems, the functional form may be as complex as a system of partial differential equations.  

\subsubsection{Example Gene Expression Experiment}

Griffiths and Tenenbaum provide an example of the Theory-Based Causal Induction approach using a gene expression experiment with lab mice.  The ontology, plausible relations, and functional form are summarized in the table below.

\begin{table}[h]
  \begin{tabular}{c c c c}
    \multicolumn{4}{c}{Ontology} \\ \hline
    Type & Number & Predicates & Values \\ \hline
    Chemical & $N_{c} \sim P_{c}$ & \emph{Injected}(Chemical, Mouse) & Boolean: \{T, F\} \\
    Gene & $N_{g} \sim P_{g}$ & \emph{Expressed}(Gene, Mouse) & Boolean: \{T, F\} \\
    Mouse & $N_{m} \sim P_{m}$ & & \\ \hline
\end{tabular}
\end{table}

The plausible relation is that an injection of chemical $C$ into mouse $M$ will lead to an expression of gene $G$ in mouse $M$, and this relationship is expected to be true for all mice $M$ with probability $p$ for each $C,G$ pair:
\begin{equation}
  \text{\emph{Injected}(C, M)} \rightarrow \text{\emph{Expressed}(G, M)}, \text{p } \forall \{C, G\}
\end{equation}

The functional forms specify that the injection of mouse $M$ with chemical $C$ is an exogenous event, determined by a coin flip with known bias (i.e., randomization):
\begin{equation}
  \text{\emph{Injected}(C, M)} \sim \mathrm{Bernoulli}(.)
\end{equation}

The expression of gene $G$ in mouse $M$, on the other hand, is modeled using the noisy-OR function, where the probability of expression is modeled as a coin flip with bias determined by $v$ which is in turn determined both by the base rate of gene expression ($w_{0}$) and whether the mouse was injected with chemical $C$:
\begin{equation}
  \text{\emph{Expressed}(G, M)} \sim \mathrm{Bernoulli}(v) 
\end{equation}
\begin{equation}
  v=w_{0}+w_{1}\text{\emph{Injected}(C, M)}
\end{equation}
  
\emph{Wason Task}.  Take the Wason task, described in Chapter Five, as a second example.  The core hypothesis in these experiments was that feedback attributed to error would be less likely to be shared.

Formulating the core hypothesis uses a many-sorted logic, meaning there are different \emph{sorts} or \emph{types} of things.  What are these types? There are only two types of things participants (P) and triples (TR).  Each type is related through multiple predicates:
\begin{itemize}
\item \emph{Feedback}(P, TR): A participant (P) receives feedback (F) on a triple (TR) with boolean values (\{FIT, DNF\}).
\item \emph{Attributes}(P, F): A participant (P) attributes feedback (F) to error with boolean values (\{Error, Not Error\}).
\item \emph{Share}(P, TR): A participant (P) shares a triple (TR) with boolean values (\{Share, No Share\}).
\end{itemize}

Using this typed logic allows us to preclude impossible statements, such as attributing a participant to error.  The ontology is summarized in the table below.

\begin{table}[h]
  \begin{tabular}{c c c c}
    \multicolumn{4}{c}{Ontology} \\ \hline
    Type & Number & Predicates & Values \\ \hline
    Triple & $N_{t} \sim P_{t}$ & \emph{Feedback}(P, TR) &\{FIT, DNF\} \\
    & & \emph{Share}(P, TR) & \{Share, No Share\} \\
    Participant & $N_{p} \sim P_{p}$ & \emph{Attributes}(P, F) & \{Error, Not Error\} \\ \hline 
\end{tabular}
\end{table}

Based on the core hypothesis, the first plausible relation is that the feedback affects the error attribution for each participant $P$ and each triple $TR$ with probability $p_{1}$:
\begin{equation}
  \text{\emph{Feedback}(P, TR)} \rightarrow \text{\emph{Attributes}(P, F)},\text{ }p_{1} \forall \{P, TR\}
\end{equation}

The second plausible relation is that the error attribution affects data sharing for each participant $P$ and each triple $TR$ with probability $p_{2}$:
\begin{equation}
  \text{\emph{Attributes}(P, F)} \rightarrow \text{\emph{Share}(P, TR)}, \text{ }p_{2} \forall \{P, TR\}
\end{equation}

The DAG \cite{gilks1994language,lee2009course,spiegelhalter1998bayesian} for these plausbile relations is the structural model that the experimenter (me) is trying to learn: 
%\clearpage
\begin{figure}[h]
  \includegraphics[width=0.7\textwidth]{dissdag}
  \caption{DAG of a core hypothesis for the Wason task. $TR$ is a triple, $F$ is feedback, $\epsilon$ is exogenous error, $ATT$ is the attribution of the trial to error, and $SH$ the sharing decision.  The two plates show that judgments are specific to a triple, and triples are specific to a participant.}
\end{figure}

The functional forms are as follows. First the feedback $F$ is jointly determined by the triple ($TR$) and error ($\epsilon$).  The double box around this node indicates that it is completely determined by its parents:  
  \[
$F$ = \left\{
\begin{array}{l l}
  FIT & \quad \text{if } $\epsilon=F$ \text{and } $TR=FIT$ \text{or } $\epsilon=T$ \text{and } $TR=DNF$\\   
  DNF & \quad \text{if } $\epsilon=T$ \text{and } $TR=FIT$ \text{or } $\epsilon=F$ \text{and } $TR=DNF$\\ 
\end{array} \right\}
\]

\begin{equation}
  \epsilon \sim \mathrm{Bernoulli}(0.2)
\end{equation}

The attribution of the feedback to error depends on the feedback, as indicated by the following plausible relation shown in the DAG:
\begin{equation}
  \text{\emph{Feedback}(P, TR)} \rightarrow \text{\emph{Attributes}(P, F)}
\end{equation}

Thus, the attribution is a bernoulli random variable: 

\begin{equation}
  \text{\emph{Attributes}(P, F)} \sim \mathrm{Bernoulli}(\alpha_{p,t}) 
\end{equation}

The parameter $\alpha_{p,t}$ determines the tendency to attribute feedback to error, and can be estimated from the data.  The example below shows the functional form of  $\alpha_{p,t}$ using the logistic likelihood function.  The tendency to attribute feedback to error is a function of the feedback ($\beta(F_{t})$) and a subject-level intercept ($\alpha_{p}$) to account for the fact that some participants are more likely to make (unconditional) error attributions than others:
\begin{equation}
  L(\alpha_{p,t}) = \frac{1}{1+ e^{\beta(F_{t})+\alpha_{p}}} 
\end{equation}

The data sharing judgment follows a similar pattern, but is instead jointly determined by both feedback and attribution, as indicated by the following plausible relations shown in the DAG:

\begin{equation}
  \text{\emph{Attributes}(P, F)} \rightarrow \text{\emph{Share}(P, TR)}
\end{equation}

And:
\begin{equation}
  \text{\emph{Feedback}(P, TR)} \rightarrow \text{\emph{Share}(P, TR)}
\end{equation}

Again, the sharing judgment is a bernoulli random variable: 

\begin{equation}
  \text{\emph{Share}(P, TR)} \sim \mathrm{Bernoulli}(\sigma_{p,t}) 
\end{equation}

The parameter $\sigma_{p,t}$ determines the tendency to share data, and can also be estimated from the data.  The example shows the functional form of $\sigma_{p,t}$, again using the logistic likelihood function.  The tendency to share trials is a function of the feedback ($\beta_{sh}(F_{t})$), a subject-level intercept ($\sigma_{p}$) to account for participant-level willingness to share, and whether the feedback was attributed to error ($\alpha_{sh}$):
\begin{equation}
  L(\sigma_{p,t}) = \frac{1}{1+ e^{\beta_{sh}(F_{t})+\sigma_{p}+\alpha_{sh}}} 
\end{equation}

This concludes the formal representation of the core hypothesis.  Using this approach as a guide can help researchers develop suitably formalized psychological theories, as Popper required  \cite{popper2002logic}.  Using a formal representation is especially helpful in dealing with the problem of changing definitions in response to refutation and blaming a theoretician for misinterpreting a theory, Popper's second and fourth conventionalist stratagems.  If definitions are clearly specified ahead of time in the ontology, then this stratagem requires more rigorous defense.  If our theory is clearly laid out in a standard format, then we guard ourselves against being attacked for theoretical misinterpretation or incoherence.  Some of this may already be done implicitly by social scientists (which Griffiths and Tenenbaum provide evidence of).  It is also important to note that formalizing a theory in this manner should come after, not precede, the content of the theory.  That is, the formalization can ``serve as a means of precise exposition, but not as a guarantee of soundness for the conceptions incorporated in the axiomatized theory'' (pg. 250) \cite{hempel1977formulation}.

\subsection{Auxiliary Hypotheses}

The Theory-Based Causal Induction approach provides a framework for developing hypotheses.  However, this is not enough for those who not only specify theories, but also experimentally test them.  To test any theory, an experimenter must make a variety of assumptions about the experimental test, usually implicit, to fairly test the theory.  This \emph{ceteris paribus} clause is a conjunction of auxiliary hypotheses that allow the core hypothesis to make well-defined predictions.

Unfortunately, there is no general solution to the problem of listing all the relevant auxiliary hypotheses: we have always forgotten something.  Luckily, there are types of auxiliary hypotheses that are used repeatedly, so it is helpful to specify them for every project so they do not have to be reconceptualized from scratched for each experiment.  There are also some auxiliary hypotheses that are common in social science research.  

\subsubsection{General Auxiliary Hypotheses}

The general auxiliary hypotheses proposed here are ones that experimenters, of any discipline, encounter.  This typology was designed to include Meehl's \cite{meehl1990appraising} instrumental auxiliaries, theoretical auxiliaries, and experimental auxiliaries, and Mayo's \cite{mayo1996error} experimental models, and data models.  

There are five general auxiliary hypotheses: 
\begin{enumerate}
\item \emph{Ontological}: Have we omitted any element from our ontology?
\item \emph{Choice of evidence}:  Do we have all the relevant evidence?
\item \emph{Interpretation of evidence}: Are the theories that we use to interpret the evidence accurate?
\item \emph{Instrumental}: Do our instruments work as intended?
\item \emph{Experimental} Is the experiment free from confound?
\end{enumerate}

\subsubsection{Ontological}

The first general auxiliary hypothesis is that the ontology we've specified in the Theory Based Causal Induction ``core'' is correct.  Predictions may fail because we've left out an important entity from our ontology.  For example, Newton's first law, that an object will maintain constant velocity unless a force acts upon it, would ostensibly be violated if one omitted atmospheric friction as a force.  This is an ontological problem, closely related to, but more general than, omitted variables.  This is also called problem framing \cite{phillipsbenefit,morgan1990uncertainty}, where one must consider whether one has included all the relevant possibilities, hypotheses, and model structures when quantifying uncertainty.  It has been speculated that severe theoretical failures are due to incorrectly specified ontologies rather than incorrect functional forms \cite{spiegelhalter2011don}.  The ontological auxiliary hypothesis is that we've included the relevant entities and predicates in our ontology, or that those that are omitted do not affect our inferences.

\subsubsection{Choice of Evidence}

The second general auxiliary hypothesis is that, when we've formed our theory and experiment, we've included all relevant evidence and nothing more \cite{phillipsbenefit,bammer2008uncertainty}.  When mustering evidence in support of or against the core hypothesis, we can consider a wide variety of evidence varying in relevance and quality (EPA, 2009) \cite{van2005combining}.  We can include direct empirical evidence, such as direct measurement of the phenomenon we are interested in (e.g., a Randomized Controlled Trial; a laboratory experiment).  We can also include semi-empirical evidence, such as measurement of a phenomenon but under different conditions that desired.  We can also use data from variables that we think are related to entities we are interested in.  If no empirical evidence is available, we can use theory to fill in the gaps.  Finally, if there is no empirical evidence and no relevant theory, we can submit our own insight and opinions as evidence (e.g., a thought-experiment; guesswork).

With a well-defined theory, one can look at each entity, plausible relation, and functional form, and categorize the evidential support, or lack thereof, for each element \cite{fischhoff2006analyzing}.  For example, we might not have an exact value (empirically verified) for the mass of a billiard ball that is involved in a collision, but may use theory about the size of the ball and the density of its material constituents, to make an approximation.  When evidence is missing, this should be carefully noted.  

\subsubsection{Interpretation of Evidence}

The meaning of any observation is not a self-evident truth or axiom.  Instead, every observation requires an interpretive theory of evidence \cite{lakatos1980methodology}, or theoretical auxiliary hypothesis \cite{meehl1990appraising}.  For example, a theory of optics is required to interpret evidence using a microscope.  Theoretical auxiliaries build on the results and derivations of other theories and evidence, hence depend on the strength of that science.  For example, the use of an aggression scale to test a theory of the relationship between aggression and organizational climate has the theoretical auxiliary hypothesis that the scale measures aggression.  If one fails repeatedly to find a predicted relationship between aggression and climate, it might mean that the prediction is wrong or that the scale is invalid.  Any psychometric theory would be an auxiliary of interpretation.

\subsubsection{Instrumentation}

Auxiliary hypotheses of instrumentation are ``the accepted theory of devices of control (such as holding a stimulus variable constant, manipulating its values, or isolating the system with, e.g., a soundproof box or white-noise masking generator, or of observation'' (pg. 110) \cite{meehl1990appraising}.  The instrumental auxiliaries commonly used in psychology, for example, are computers for online surveys and stimulus presentation, pencils.  A failed computer, broken pencil, ripped survey, a typographical error in instructions \emph{etcetera} would be failed auxiliary hypotheses of instrumentation.  While the meaning attributed to observations derived from these instruments depends on the auxiliary hypotheses of interpretation, the actual process that was implemented is one of instrumentation.

\subsubsection{Experimentation}

Finally, auxiliary hypotheses of experimentation concern the internal validity of the experiment or the ``experimentally realized conditions'' \cite{meehl1990appraising}.  Examples include the assumption that volunteers and non-volunteers for the experiment do not systematically differ in their responsiveness to experimental treatments (volunteer bias; \cite{rosenthal1975volunteer}), that the task is not too cognitively demanding (\cite{cannell1981research}), and that the stimuli used in the experiment capture the important elements of the constructs they represent (stimulus sampling, \cite{wells1999stimulus}).  These can also include issues in Mayo's experimental models and data models, including sample size and test statistics, protocols, descriptions of materials, and how they are related to possible errors.

\subsection{Specific Auxiliary Hypotheses}

There are also six specific auxiliary hypotheses that social science researchers typically invoke when designing and interpreting experimental evidence.  Some combination of these auxiliary hypotheses are usually discussed in separate textbooks on survey design, research methodology, or psychometrics.  The classification proposed here unifies them with a specific purpose as auxiliary hypotheses, rather than just topics of study in their own right.  I categorize them by the acronym MIMECC: 

\begin{enumerate}
\item M\emph{otivation}: Are participants motivated to behave accurately?
\item I\emph{nternal}: Are causal inferences free from confounds?
\item M\emph{easurement}: Do measurements meet required assumptions?
\item E\emph{xternal}: Do inferences in the sample generalize to the population?
\item C\emph{onstruct}: Are the concepts used valid?
\item C\emph{ommunication}: Do participants and researcher agree on what is expected?
\end{enumerate}            

Motivation is usually addressed by survey researchers to get participants to respond to survey requests \cite{dillman2007mail}.  Internal validity, external validity, and construct validity are canonical parts of introductory social science research methods \cite{shadish2002experimental}.  Measurement is usually addressed as a topic in psychometrics \cite{suppes1963basic,coombs1970mathematical}.  Finally, communication is usually addressed as a separate topic related to risk communication and survey research \cite{fischhoff2011communicating,schwarz1999self}. 

\subsubsection{Motivation}

The first specific auxiliary hypothesis is motivation.  Even with a perfectly logical experiment, if participants don't care about the task then it is not possible to get good data from them.  Motivating participants can be achieved by increasing the benefits of participation, decreasing the costs of participating, establishing trust with the researchers \cite{dillman2007mail}, and giving participants an incentive to respond carefully and truthfully (incentive compatibility or mechanism design) \cite{fudenberg1991game}.  Importantly, tasks that participants find fun are likely to result in high quality, engaged responses \cite{von2006games}.  For example, Fold-it (\url{http://fold.it/portal/}) is a protein folding game that provides very high quality data, from motivated and careful participants, that could be used to discover human discovery strategies \cite{khatib2011algorithm}.  These are sometimes called serious games \cite{michael2005serious,bergeron2006developing,carroll2004beyond,koster2005theory}, gamification \cite{mcgonigal2011reality,reeves2009total}, or games with a purpose \cite{von2006games}.  See Dillman \cite{dillman2007mail} for more on motivating participants in surveys and Pink \cite{pink2010drive} for motivation more generally.  

\begin{itemize}
\item Benefits: Are the benefits enough to motivate participants?
\item Costs: Are the costs of participation so high as to demotivate participants?
\item Trust: Do the participants trust the researchers?
\item Incentive Compatibility: Are participants incentivized to respond truthfully?
\item Fun: Do participants enjoy performing the task?
\end{itemize}

\emph{Wason Task}.  In the rule-discovery task of Chapter Five, participants were offered \$5 (Benefits) for 30 minutes of their time (Costs).  They were given an informed consent document that specified the research was from a university (Trust).  Participants were not offered money for accurate answers until Experiment Three (Incentive Compatibility).  No assessment was made as to whether the task was fun (Fun).

\subsubsection{Internal}

The second category of specific auxiliary hypothesis is internal validity.  This is by far the largest category, and the most extensively studied.  Whenever an experiment is conducted, one conjectures the auxiliary hypothesis that no other factors aside from our intervention vary between the experimental and control groups.  Random assignment is the first step in doing this: those who are randomly assigned to control and treatment group are expected to be equivalent, in the long run, on every variable they could differ on.  However, if the participants know the condition they are in, or know the hypothesis of the experiment, then they would differ.  As a result it is important that participants and researchers are blinded to both the condition the participant is in and the hypothesis of the experiment.  

Another severe problem of internal validity is incomplete outcome data.  Some participants may not complete the experiment.  If the process that leads to missing data is random, then participants in the treatment and control group will, in the long run, not differ on other variables.  However, if there is systematic tendency for some participants to drop out in a way that is related to the effect one is trying to measure, then bias will result. 

It is also possible that experimental interventions have subtle side-effects.  For example, the mere novelty of the intervention (i.e., putting the participant in a new situation) or knowledge that the participant is in a study (the Hawthorne effect), can cause differences between groups. See Shadish, Cook and Campbell \cite{shadish2002experimental} for more on threats to internal validity.  

\begin{itemize}
\item Assignment: Was the assignment to condition adequately (randomly) generated?
\item Condition Blinding: Was the method of condition assignment adequately concealed?
\item Hypothesis Blinding: Was unnecessary knowledge of the assigned condition adequately prevented during the study?
\item Incomplete Outcome Data: Were incomplete outcome data adequately addressed?
\item Researcher Expectancies: Has the researcher intentionally or unintentionally influenced the participants?
\item Novelty: Is the treatment intrusive and novel?
\item Disruption: Is the treatment disruptive?
\item Compensatory Rivalry: Does the control group know of the treatment group and try to outperform them?
\item Resentful Demoralization: Does the control group know of the treatment group and perform worse as a result?
\item Treatment Diffusion: Is the control group partially or fully exposed to the treatment?
\item Instrumentation: Are measurement instruments stable over time?
\item Testing: Does repeated measurement affect the measurements?
\item Instability: Is the measurement process reliable over time?
\item Selection: Are there systematic differences in respondent characteristics between groups?
\item History: Has an event occurred between treatment and measurement?
\item Maturation: Are results affected by the procession of time, such as boredom?
\item Regression: Are units selected for extremeness and thus regress to their mean?
\item Attrition: Have respondents been lost to treatment or measurement?
\item Selection-Interaction: Are participants in different treatment groups differentially exposed to internal validity threats?
\item Testing Interactions: Are there reactive measures wehre asking a question changes the response to the question?
\item Identification: Does each causal factor lead to a different distribution of data?
\end{itemize}

\emph{Wason Task}.  In the rule-discovery task of Chapter Five, random assignment was automated by Qualtrics (Assignment) and this automation prevented me from knowing which participant received what treatment (Condition Blinding).  I was not blinded to the hypotheses, and it is unclear whether the participants were (Hypothesis Blinding).  Some participants did not complete the task (Incomplete Outcome Data and Attrition).  It is possible that the development of the materials in the study unintentially conveyed the purpose of the study to participants or that expectancies affected statistical analyses, as they were not blinded (Researcher Expectancies).  It is not clear how novelty or disruption effects would appply in this context (Novelty/Disruption).  Although participants were blinded to alternative conditions, they were not blinded to their own conditions (in Experiments Two and Three).  If they somehow were told by a friend the contents of the task, then several threats are possible (Treatment Diffusion, Resentful Demoralization, and Compensatory Rivalry).  The instruments were computerized so should not have degraded, unless I made a programming mistake (Instrumentation).  Having participants make error attributions may subsequently affect their response to data sharing decisions, where if the participant had not explicitly made the attribution there may have been no such association (Testing).  The measurement process was stable over time as it was completely computerized, unless Qualtrics broke (Instability).  Participants were randomly assigned so there would no be selection (Selection).  I know of no events that occurred either after the beginning of the entire research programme or after each participant began the survey itself (History).  Participants could get tired, frustrated, or bored with the task, as it was long and difficult (Maturation).  Participants were not selected for extremeness (Regression). 

\subsubsection{Measurement}

Any empirical study involves measurement, and the underlying formal structure of these measurements determines the type of inferences that can be made from them.  Thus, auxiliary hypotheses of measurement propose that the mathematical assumptions needed to represent our data are correct.  The important elements of this auxiliary hypothesis are representation, uniqueness, meaningfulness, and scaling.

Representation specifies the conditions under which we can create a mathematical (numerical) model of our measurements. For example, two bushels of wheat added to three bushels of wheat results in five bushels of wheat in exactly the same way that $2+3=5$ \cite{suppes1963basic}.  In this case, the arithmetic operator addition holds true for bushels of wheat.  If empirical relationships hold in a manner that is identical to a set of numerical relationships, then the latter is a mathematical representation of the former.  This is called an isomorphism.  

Uniqueness is the degree to which the numbers we assign to observations can be exchanged for other numbers while preserving the properties of the representation.  For example, a transitive (ordered) scale is preserved under any monotonic function.  Interval scales are unique only up to an affine transformation, which preserves both order and distances between items in the ordered sets. The set of transformations that preserve the relational properties of the system (e.g., ordering, fixed differences) are the set of admissible transformations for the measurement system \cite{coombs1970mathematical}.

Meaningfulness delimits the valid assertions that can be made based on the representation and uniqueness of a measurement.  The conclusions we draw should not change if we make an admissible transformation to our numbers.  Any statement that does not change based on admissible transformations is meaningful; those that do change are not meaningful.  

Finally, Scaling acts about the practical transformation of measurements into numerical scales, with possible errors.  DeVellis \cite{devellis2011scale} argues that items sharing a common cause are a scale, if they share a common consequence they are an index and if they are just part of a superordinate category then they are an emergent variable. 

\begin{itemize}
\item Representation: What are the relational properties of the measurement system (e.g., transitivity, completeness)?
\item Uniqueness: What are the set of homomorpisms or isomorphisms equivalent to our measurement system?
\item Meaningfulness: What transformations can we apply to our data while preserving their meaning?
\item Reliability: Are measurements test-retest, inter-rater, and internally reliable?
\item Stability: Are measurements stable to split-half, parallel forms, and alternative-forms reliability?
\end{itemize}

\emph{Wason Task}.  In the rule-discovery task of Chapter Five, error attributions, feedback, and data sharing decisions were all nominal variables, whereas probability judgments were supposed to be absolute (Representation).  Any one-to-one transformation is admissible for a nominal variable \cite{coombs1970mathematical}, but there are no admissible transformations to probability judgments if they are interpreted as absolute probabilities (Uniqueness, Meaningfulness).  There was no attempt at measuring reliability (test-retest, interrater, internal) and stability (split-half, parallel forms, alternative forms) of measurement instruments (Reliability and Stability).

\subsubsection{External Validity}

External validity specifies the difference between the experiment we've conducted in its ideal form and the real world circumstance to which we intend to generalize.  External validity is important because if we choose an experimental situation that differs from the world we are trying to extrapolate to, then we may find no effect in the lab when there is an effect in the real world, or vice versa.  To help with this, we'd like to randomly sample from the population we are interested in extrapolating to, giving them the same intervention that they would receive in the real world, and measuring the actual outcome of interest, rather than a proxy or surrogate.  See Turner \emph{et al.} \cite{turner2009bias} for a careful definition of external validity.  For issues of recruitment \cite{treweek2010strategies,edwards2009methods,prescott1999factors,watson2006increasing}.

\begin{itemize}
\item Population: Are study subjects in the idealized study drawn from a population identical to the target population with respect to age, sex, etc?
\item Intervention: Is the intervention in the idealized study identical to the intervention used?
\item Control: Is the control group in the idealized study the same as the control group used?
\item Outcome: Is the study outcome the same as the idealized study outcome?
\end{itemize}

\emph{Wason Task}.  In the rule-discovery task of Chapter Five, participants in the study were either CMU students or MTurk participants, and thus were not drawn from the population of interest (working scientists; Population).  The interventions used in the study were not similar to that experienced by real scientists, either in terms of incentives (\$5 or \$100 is nothing like tenure or a pharma job) or penalties (earning less than \$5 is nothing like being fired, e.g., Stapel) (Intervention).  The data sharing outcome was nothing like a publication decision (Outcome). 

\subsubsection{Construct Validity}

Construct validity is probably the most difficult to understand and vaguely defined auxiliary hypothesis in all of social science.  Construct validity is a statement about an observed pattern of correlations in data and unobserved causes or latent variables proposed by our theory \cite{cronbach1955construct}.  This is different from criterion validity which relates a pattern of correlations among observed or operationally defined measurements.  It also differs from content validity because ``no criterion or universe of content is accepted as entirely adequate to define the quality to be measured'' (pg. 282) \cite{cronbach1955construct}.  

Construct validity is epistemic and empirical rather than ontological and theoretical \cite{borsboom2004concept}. If our measures do not correlate with what they are supposed to they do not have convergent validity.  If they are correlated with things they aren't supposed to be related to then they do not have discriminant validity.  If either convergent or discriminant validity are violated then construct validity is poor.

\begin{itemize}
\item Convergent: Does the construct correlate well with other constructs it should be related to?
\item Discriminant: Is the construct uncorrelated with constructs it shouldn't be related to?
\item Necessity: Are there any dimensions contained in the constructs that are not contained in the measures?
\item Sufficiency: Are there any dimensions contained in the measures that are not contained in the construct?
\item Construct Stability: Does the dimensionality of the measures change across treatment conditions?
\item Method Stability: Would the fit of the measures to the construct change if conducted using a different method (format)?
\item Level Stability: Is the level of the treatment administered large enough to produce an effect?
\item Process Accuracy: When responding to the measures, are participants using the same process that the construct supposes?
\end{itemize}

\emph{Wason Task}.  In the rule-discovery task of Chapter Five, the constructs of error attribution and data sharing were not validated.  There was no attempt to correlate them with other judgments that should be related (Convergent Validity) or with other judgments that shouldn't (Discriminant Validity). 

\subsubsection{Communication}

The most well understood but overlooked auxiliary hypothesis is communication: that the participant and experimenter agree on the meaning of the experimental instructions and expected behaviors of the participant.  This can fail in a number of ways.  If instructions are written or read aloud, participants may not have the cognitive capability or reading ability to comprehend the instructions.  If they are asked to perform a task that is complex, they may not be able to integrate and use the information provided in the task as required.  Participants may gloss over key instructions if they are not made salient and the participant's attention is not drawn to them.  They also may not encode instructions even if they briefly attend to them.  

There are also problems with the language games played whenever natural language communication is involved. If participants have only basic knowledge or values, they will construct responses to specific questions that don't reflect what they actually believe or want \cite{fischhoff1991value}.  If excess information is conveyed, participants may feel that they are supposed to use the information, and if too little information is conveyed, they may try to fill in the gaps with what they think the researcher wants \cite{schwarz1999self}.  If participants are deceived they may not find the researcher trustworthy and not attend to information or use it in the expected manner.  See Fischhoff, Brewer, and Downs \cite{fischhoff2011communicating} for guidance on drafting high quality communications.

\begin{itemize}
\item Comprehension: What is the reading level of the instructions?  What is the propositional content of the instructions?
\item Attention: Do participants perceive the relevant communications?  Do they register relevant information in long-term memory?
\item Use: Do participants understand how to act on communicated information?  Do they use prospective memory effectively, using information when they need to?
\item Quantity (a): Is to little information communicated to participants?
\item Quantity (b): Is too much information communicated to participants?
\item Quality: Is false or insubstantiated information conveyed to participants?
\item Manner: Is information commununicated to participants ambiguous, prolix, or disorganized?
\end{itemize}

\emph{Wason Task}.  In the rule-discovery task of Chapter Five, the instructions were all modified to be at a 6th to 8th grade reading level (Comprehension).  Questions were asked both during and at the end of the task to make sure they understood and attended to the instructions (Attention), and whether they understood how to use the information (Use).  Attempts were made to avoid deceiving or omitting any information (Quantity a,b; Quality), however too little information was provided about the probability judgments (Quantity a).  Information was communicated in a concise way, although some participants found it sterile (Manner). 

\subsection{Paradigmatic Auxiliary Hypotheses}
Finally, there have two auxiliary hypotheses that are necessary for the construction of a paradigm: 1) sparsity of effects and 2) ideal interventions.  

\subsubsection{Sparsity of Effects}

Any paradigm involves a fixed set of background factors that do not vary, and a set of interventions or experimental factors that do \cite{schunn19954}.  The auxiliary hypotheses of experimentation discussed previously deals with the possibility that the factor that varies is not the only one that varies between treatment and control group.  However, the sparsity of effects assumption is the complement to this: that the background set of factors that do not vary have no special effect on the outcome of the experiment.  This is the assumption that we haven't created a syzygy, where the effect of our experimental manipulation is really the product of particular quirks of the paradigm, quirks which are not part of the theory we are using that interact with the experimental manipulation.  Sparsity of effects for paradigms assumes that items not manipulated are not causing higher order interactions.

\emph{Wason Task}.  In the rule-discovery task of Chapter Five, the sparsity of effects auxiliary hypothesis is that items not manipulated are not causing higher order interactions, for example that the labeling of FIT/DNF is not causing higher order interactions with other variables, such as tendencies to choose triples that are expected to be affirming or disconfirming.

\subsubsection{Ideal Intervention}

Finally, experimental manipulations are formally entailed by changes to causal graphical models \cite{pearl2000causality}.  For example, if we want to determine whether B has a causal effect on A, then if we manipulate B directly, all possible causes of B are eliminated, rendering it exogenous.  This auxiliary hypothesis is that the manipulation of B is ideal, in the sense that if we try to change the state of B it responds perfectly, not stochastically, and that our intervention does not affect any other factors involved in the system we are investigating.  However, there are a number of ways interventions can go afoul.  For example, an intervention may be noisy, and only produce the desired effect some of the time.  An intervention may also cause other effects than just the one we desire.  Scheines \cite{scheines2005similarity} shows that choosing an ideal intervention is the same as choosing an instrumental variable.  This can be seen as the construct validity of the intervention and is typically validated using a manipulation check.

\emph{Wason Task}. In the rule-discovery task of Chapter Five, the interventions were the feedback, incentives and penalties.  No manipulation checks were used for the feedback or penalties, but there was a manipulation check for the incentives using an open-ended response.

\subsubsection{Creating and Representing a Paradigm}

Stage One lays the formal foundation of core and auxiliary hypotheses needed for theory testing.  From the above analyses we have a formalized theory along with auxiliary hypotheses that are nominally classified as being relevant or irrelevant to the task.  Once these are carefully articulated, it is exciting to get to the critical experiment.  However, failed predictions from this premature experiment usually indicate that Stage One provides nowhere near the necessary preparation to perform normal science.  We don't know if the auxiliary hypotheses are actually relevant, how likely they are to fail, and what failure effects they may have.  While Stage One helps protect against the need for invoking auxiliary hypotheses ad-hoc, or misusing a theory (Popper's first and fourth conventionalist stratagems), more is needed.

Researchers are all familiar with the concepts of pretests, pilot tests, and experimental tests.  However, the distinctions among them are not typically made sharply in their training nor documented systematically in research reports.  As a result, there is greater risk of a fortuitous result from a pretest being ``promoted'' to the status of an experimental test or, conversely, an experimental test being ``demoted'' to a pilot test when it produces unexpected (or unwanted) results.  In order to take full advantage of pleasant or unpleasant surprises, experimenters need a systematic empirical approach to dealing with auxiliary hypotheses, so that their validity is neither over- nor understated.  To do this, we clarify the following four stages of testing, each with their intended purpose and assumptions.

\section{Stage Two: Pre-testing}

The second stage, \emph{pre-testing}, addresses the problem of getting information from experiments when we are uncertain both about whether our core hypothesis is true or false and whether our experimental design satisfies the necessary auxiliary hypotheses we've laid out.  For example, most experiments make the auxiliary hypothesis that participants understand the instructions provided to them.  A pre-test would propose these instructions then follow with a quiz to test comprehension.  If participants can successfully introspect about their understanding, there should be predictable differences in responses among those who do and do not interpret the instructions as intended. Testing that assumption about introspection regarding the instructions requires a separate experiment, with its own complications, perhaps reduced by the strength of the general science regarding those issues.  

Thus, the goal of pretesting is to collect data to assess and minimize the risk of failed auxiliary hypotheses.  It helps solve the problem of choosing the experimental design that yields true auxiliary hypotheses (e.g., the participant understands the instruction), or minimizes the chance that the auxiliary hypothesis will be false.  

The proposed method works as follows.  Suppose one is concerned that a participant does not understand some concept communicated in the instructions.  By performing a cognitive interview, or giving the participant a quiz, one can estimate the individual-level failure probability of that communication \cite{vose2008risk}. 

\emph{Wason Task}.  Here is an example taken from my pretesting of the Wason rule-discovery task used in Chapter Five.  First, I considered the specific assumptions required to test my core hypothesis using \emph{pre-posterior analysis}.  Pre-posterior analysis is an important kind of suppositional reasoning \cite{levi1996sake}, where one supposes that an outcome occurred and then entertains the set of serious possible causes of that outcome.  In this way, one pre-empts the data by making causal attributions beforehand.  The intent is to minimize the regret of not having considered a threat to the validity of our experiment beforehand.

By imagining cases where I got unexpected results, I came up with 16 auxiliary hypotheses (listed below) that I felt were necessary to give a proper test to my core hypothesis.  As can be seen, they all focus on the comprehension auxiliary hypothesis:  

\begin{enumerate}
\item Participant does not understand how the error works.
\item Participant doesn't understand the Actual Rule concept. 
\item Participant doesn't understand the triple concept.
\item Participant doesn't understand the Your Rule concept.
\item Participant doesn't understand their task in general.
\item Participant doesn't understand what FIT/DNF means.
\item Participant doesn't understand what the feedback means.
\item Participant doesn't understand what it means to create a new triple to test Your Rule.
\item Participant doesn't have a hypothesis that they believe. Participants don't know how to confirm or disconfirm their hypothesis.
\item Participant doesn't understand how to change the error attributions.
\item Participant doesn't understand how to record error on the spreadsheet.
\item Participant doesn't understand how to use the spreadsheet.
\item Participant doesn't understand that evidence is disconfirming or confirming.
\item Participant doesn't understand that they only get one guess.
\item Participant doesn't understand the guessing.
\item Participant doesn't understand what the error attribution task is.
\end{enumerate}

Cognitive interviews were then conducted to examine whether these auxiliary hypotheses were satisfied by the experimental design.  This was done using think-aloud protocols, where a participant is asked to ``think-aloud'' while interpreting the instructions.  They were also asked to respond to retrospective probes \cite{willis1999cognitive,ericsson1980verbal}.  The participants were sixteen members of the Pittsburgh community recruited through a web advertisement through the Center for Behavioral Decision Research.  They were paid \$5 for 30 minutes of their time.

The list below shows the retrospective probes that were intended to examine the participants' understanding of the task. These were asked after the participant completed the think-aloud portion of the interview.

\begin{enumerate}
\item Can you explain, in your own words, what the ``Error'' is and its purpose in the task?
\item On each trial, how likely is it that an error will occur?
\item Can you explain, in your own words, what the ``Actual Rule'' is and its purpose in the task?
\item Can you explain, in your own words, what the ``triple'' is and its purpose in the task?
\item Can you explain, in your own words, what ``Your Rule'' is and its purpose in the task?
\item Can you explain, in your own words, the purpose of the task, in general?
\item Can you explain, in your own words, what ``FIT'' is and its purpose in the task?
\item Can you explain, in your own words, what ``Does not fit'' is and its purpose in the task?
\item Can you explain, in your own words, what the ``feedback'' is and its purpose in the task? 
\item Can you explain, in your own words, what the purpose of the ``new triple'' is in the task? 
\item How much do you believe your hypothesis?
\item Can you explain, in your own words, what it means to ``change your mind about which trials you received false feedback.''
\item Can you explain, in your own words, how to record error on the spreadsheet?
\item Can you explain, in your own words, how to use the spreadsheet in general?
\item Can you explain, in your own words, what ``Fit'' or ``Does not fit'' means for Your Rule?
\item How many chances do you have to guess?
\item Can you explain, in your own words, what ``guessing'' is and its purpose in the task?
\item Can you explain, in your own words, what the ``Error'' is and its purpose in the task?
\end{enumerate}

Figures 7.1-7.3 show the failure rates for the sixteen auxiliary hypotheses for three iterations of the instructions in three sessions.  The blue squares are posterior predictions of the failure probability, along with 95\% credible intervals.  The red squares are the actual (observed) failure probabilities.  Each interview session included a series of 4 participants.  After each session, the instructions were revised to reduce the probability of failure for each auxiliary hypothesis.  A $\mathrm{Beta}(1, 3)$ distribution was taken as the prior distribution for each failure rate, where  $\mathrm{Beta}(F, S)$ represents failures ($F$) to comprehend and successful ($S$) comprehension.

Overall, for any participant, the failure of at least one auxiliary hypothesis among the sixteen was very likely.  Across the three iterations, 43 of the 48 (90\%) observations fell within the 95\% credible interval, indicating slight overconfidence but overall good calibration.  It is important to note from this graph, that after only 4 iterations of the materials, I was able to have well-calibrated 95\% credible intervals, with actual failure probabilities falling within the posterior predictions roughly 95\% of the time.  Between the fourth and sixth iterations, the accuracy of point estimates also greatly increased.

\begin{figure}[h]
 \centering 
  \includegraphics[width=1\textwidth]{iteration4.png}
  \caption{Posterior predictions and observed failure rates for iteration 4 of the Wason task design, Experiment One.}
\end{figure}

\begin{figure}[h]
\centering
  \includegraphics[width=1\textwidth]{iteration5.png}
  \caption{Posterior predictions and observed failure rates for iteration 5 of the Wason task design, Experiment One.}
\end{figure}

\begin{figure}[h]
  \centering
  \includegraphics[width=1\textwidth]{iteration6.png}
  \caption{Posterior predictions and observed failure rates for iteration 6 of the Wason task design, Experiment One.}
\end{figure}

From the figures one can see that there was a consistent and intractable failure of auxiliary hypotheses 8 (the meaning of creating a new triple), 10 (changing their error attributions at the end of the task), and 15 (their final answer).  The process determined not only that these auxiliary hypotheses are a problem, but also produced calibrated failure probabilities for each auxiliary hypothesis.  By knowing the probability of failure, it is possible to adjust inferences appropriately, as will be explored in the evidence synthesis approach described in Stage Five.

\subsection{Acceptance Sampling}

Given a list of auxiliary hypotheses (specifications) and the ability to repeatedly adjust the design of the experiment to try to minimize their failure risks, pre-testing is in a form that is amenable to industrial design techniques, such as statistical quality control.  For example, suppose we have designed an experiment with a specific set of instructions, questions, \emph{etcetera}, as was done in the Wason example above.  We have also agreed, provisionally, on ways of measuring the success or failure of each auxiliary hypothesis (e.g., open-ended responses on a cognitive interview).  We now want to choose a sample size of participants to measure the auxiliary hypotheses, where if few enough failures occur we would accept the experimental design and move to Stage Three, but if too many failures occur, we redesign the experiment hoping to reduce the failure risks.  This sequential testing process is equivalent to acceptance sampling that is used in industrial design and statistical quality control.  

Here is simple example using literacy and motivation.  A participant given instructions roughly between 6th and 8th grade reading level performs reasonably well on both open-ended and quiz-based tests of comprehension of the instructions.  Thus, the comprehension auxiliary hypothesis is made within the tolerance level by the readability of the instructions.  If we test 10 participants with these instructions, and only 1 of them fails to understand them, then we can apply acceptance sampling to get an idea of the risk of failures if we expand our examination more generally and accept those instructions as our experimental method.  This is the problem of choosing a sample size to assure a specific quality of the product.  Although the risks are logically infinite, they tend to actually be quite small, that is, detectable in about 20 participants \cite{morgan2001risk}.  

Additionally, it is also possible to estimate the risks of new, previously not considered auxiliary hypotheses.  The latter can be modeled by a poisson or exponential distribution.  The key is that eventually the surprises will converg to zero: we may come up with at most 20 or 30 auxiliary hypotheses before we stop noticing surprises.

By making auxiliary hypotheses concrete, specific, and measurable, it is possible to create a precisely designed product that is a scientific experiment.  The eventual aim of Stage Two is to manufacture parts of an experiment that meet specified auxiliary hypotheses exactly, just as the parts of each modern car is produced almost exactly the same (within some small margin of error)\cite{shewhart1939statistical}.  

\section{Stage Three: Pilot-Testing}

Although Stage Two measures and reduces the risk of failed auxiliary hypotheses, we can never be sure our measurements are right.  Suppose, for example, that we thought a question about the participant's comprehension of the task was a valid measure.  To test that, however, would require a separate experiment to validate the measure.  It is easy to see that testing our tests leads to an infinite regress.  Thus, once we settle on tests we consider reasonable, and, according to those tests used in stage two our auxiliary hypotheses are measured to have low risk of being violated, we can conduct an experiment to see if we are right.  

Stage Three does this.  Instead of infinitely testing our tests, as would be required if we were committed to staying in Stage Two, Stage Three fuses the auxiliary hypotheses with the core hypothesis to make experimental predictions.  However, the key feature of Stage Three that differentiates it from Stage Four is that it uses Lakatos' \emph{negative heuristic} \cite{lakatos1980methodology}: if something goes wrong with the predictions, we consider that our Stage Two pre-testing didn't reveal all the problems with our auxiliary hypotheses; that is, we protect our core hypothesis and direct refutation at the fallible auxiliary hypotheses.  

Why is this stage necessary, since we can directly test auxiliary hypotheses in Stage Two?  First, we may not have thought of all necessary auxiliary hypotheses.  Directly testing them in Stage Two does not guarantee that we find all important auxiliary hypotheses, but instead helps us estimate the failure probabilities of ones we've already considered.  Second, there are limitations to the methodologies that can be employed in Stage Two (e.g., cognitive interviews).  By using experimental manipulation instead of interviews and other measurements, and assuming that the core hypothesis is correct, one can draw the conclusion that a failed prediction in an experiment is an indication of a lurking omitted auxiliary hypothesis or a failed auxiliary hypothesis we have already considered.  The form of the failed prediction can help us generate a plausible auxiliary hypothesis to explain it.  That is, Stage Three uses failed prediction to generate ideas about possible ways our auxiliary hypotheses may have failed and possible ways of satisfying them.  New auxiliary hypotheses generated in this way are part of what Mayo calles the \emph{error repertoire} \cite{mayo1996error}.

\emph{Wason Task}.  In Experiment One there was an unexpected lack of association between data sharing and both the feedback and whether the trial was attributed to error. The open-ended answers strongly suggested that participants didn't understand or pay attention to this part of the task.  Tharing judgments were at the end of the task when participants were ready to quit.  Participants also indicated that they wanted to propose new trials or state trials rather than triples.  Their confusion with the data sharing measure was not considered beforehand, and the maturation threat was considered but not taken seriously enough. 

Experiment Two, in contrast, did find a strong relationship between error attributions, feedback, and data sharing.  However, the response format that was used left two doubts in my mind.  One was a testing artifact, where making the error attribution before the sharing judgment may cause some association between the two.  First, since True and Share were both buttons on the left side of the screen, and False and Do not share on the right side, any tendency to merely click left or right would make a strong association between the two judgments.  Second, since they were right after each other on the task, the participant may have inferred that they should be related, a demand effect.  There was also no effect of the incentive on their scores, attribution patterns, or data sharing, even though the participants worked substantially harder.

Experiment Three used both trial-by-trial and end-of task data sharing judgments.  While this doesn't account for the order effect or demand effect, it does allow comparison of end-of-task judgments to trial-by trial judgments, the former being less susceptible to these two artifacts.  However, differences between trial-by-trial and end-of-task judgments could be caused both by differences in bias and differences in strategy, where participants do not know their final hypothesis during the task, making their sharing judgments ambiguous, but do know their final hypothesis at the end of the task, making a sharing strategy more worth developing.  Thus, Experiment Three does not account for these artifacts of trial-by-trial judgments.  Second, participants were given the perverse or compatible incentive to share, both giving them motivation to work toward finding a specific hypthesis--one that is convincing--and to sharing specific data that are consistent with this hypothesis.  The data were exactly opposite of what I predicted: those in the perverse incentive condition tended to share all of their data.  This was an interesting anomaly that didn't fit readily into either the specific auxiliary hypotheses mentioned in Stage One, nor those generated from the preposterior analysis in Stage Two.  Instead I was completely dumbfounded.  

More generally, although keeping track of the mistakes people generally make (i.e., the specific auxiliary hyptoheses), the mistakes I expect (from the preposterior analysis), and the mistakes I've previously made (defined implicitly in the experimental design that avoids the mistakes or explicitly in what Mayo calls the \emph{error repertoire}, the errors I identified in all three experiments were ones that I hadn't considered beforehand.  This could be because generating a new explanation is more interesting than attributing failed predictions to old errors (from the specific auxiliary hyptoheses, etc.), or that new explanations seem more convincing once new data are found (the hindsight that was expected, but not found, in the Experimental Surprises research).  

\section{Stage Four: Testing}

In Stage Four, the testing phase, the researcher is confident enough in her auxiliary hypotheses that she is willing to admit falsification of the core hypothesis if that is what the data suggest.  After extensive pre-testing and pilot-testing in Stages Two and Three, we come to an experimental design that we believe has low enough risk of failed auxiliary hypotheses that if a failed prediction occured we would be willing to discard our hypothesis rather than invoke an auxiliary hypothesis to explain the failure.

There are two requirements to get to Stage Four testing.  The \emph{less} important requirement is that one is confident in one's auiliary hypotheses.  The \emph{more} important requirement is that one is willing to admit falsification of the core hypothesis under \emph{any} circumstance.  If this is not the case, then the core hypothesis is not appropriate for scientific investigation, for example if it is `bubba psychology' \cite{mcguire1973yin}.  This occurs when experiments are ``not to test our hypotheses but to demonstrate their obvious truth'' for hypotheses that are ``so clearly true (given the implicit and explicit assumptions).''  If this is the case then, 
\begin{quote}
``the experiment tests is not whether the hypothesis is true but rather whether the experimenter is a sufficiently ingenious stage manager to produce in the laboratory conditions which demonstrate that an obviously true hypothesis is correct. In our graduate programs in social psychology, we try to train people who are good enough stage managers so that they can create in the laboratory simulations of realities in which the obvious correctness of our hypothesis can be demonstrated.''
\end{quote}

To be able to reach Stage Four testing, one must ask oneself at the outset of the research endeavor whether one would be willing to admit falsification of the core hypothesis given a perfect test.  If not, then Stage Four testing can never occur. 

As Stage Four testing is what most researchers have in mind when they develop methodological and statistical tools, most of the mainstream tools developed for the social sciences are applicable. This is because Stage Four testing assumes that one has already debugged the experiment, and that the only thing that could result in a failed prediction of a theory is that the theory is wrong, and an alternative is correct.  This is the typical assumption used in most mainstream statistical approaches. 

\subsection{Wason Task}
The Wason Task did not reach this stage.  After reflecting on whether I would be willing to give up any of my core hypotheses, I think I would not be willing to give up the core hypothesis that feedback and error attributions determine data sharing.  I believe these to be true.  Thus, they are poor core hypotheses.  However, the core hypothesis that these data sharing policies are either normatively justified or effective are ones that I am willing to accept any answer on.  

\section{Stage Five: Evidence Synthesis}
Unfortunately, if one completes the previous four stages, one will likely have a series of experiments that differ in complicated ways.  At best, this is a nonstationary stochastic process.  At worst, the experiments may seem completely unrelated to each other.  How can one make sense of all this data?  Is there a point where the `warm-up' period of stages two and three ends and the `real science' of Stage Four begins?

The common problem is this: one completes a series of experiments and finally one finds a set of auxiliary hypotheses where the core hypothesis makes a correct prediction.  One has made a discovery.  How should the previous experiments weigh on the judgment of the validity of the final discovery?  If they are seen as related, they should cast doubt on our final result, requiring additional replications before firm conclusions can be drawn. If they are seen as unrelated, then the final experiment can be treated as the first member of its class.  Similarly, if one is compelled to report these experiments to the scientific community, how far back should one go?  To pilot-tests, pre-tests, or even thought experiments? 

This is a very difficult problem.  It is a typical criticism of meta-analysis; that is, the studies are unique, different, and thus not exchangeable.  Treating them as exchangeable in that case is seen as flawed reasoning rather than reasonable assumption.

The solution I sketch here can be called \emph{Generalized Meta-Analysis} (GMA).  It is general enough to take into account arbitrary relationships between core hypothesis, auxiliary hypotheses, and experiments, including probabilistic risk analysis of auxiliary hypotheses.  It can implement the Theory-Based Causal Induction framework directly.  The approach uses Hierarchical Bayesian Models (HBMs) in the stochastic programming language Church \cite{goodman2008church}.\footnote{An important warning should be made before beginning discussion.  The construction of the HBM should not interfere with critical and dynamic thinking.  What I describe here is merely a formalized representation and way of performing computations on what one already believes.  One should not rigidly adhere to it or think that the numbers are in any sense correct any more than one's subjective beliefs are correct.}  

The approach works as follows.  The hypothesis under examination is called the core hypothesis and is evaluated as a stochastic program along with auxiliary hypotheses that are required for it to make predictions about each experiment.  While the core hypothesis remains the same over the experiments, the auxiliary hypotheses need not remain the same.  Each set of auxiliary hypotheses allow each experiment to be interpretable in light of the core hypothesis.  The core hypothesis makes probabilistic predictions for each experiment conditional on the auxiliary hypotheses.

There are three difficulties with this approach: 1) knowing the appropriate set of auxiliary hypotheses for each experiment, 2) knowing how each auxiliary hypothesis modifies the predictions of the core hypothesis, and 3) knowing how likely each of the auxiliary hypotheses are to be met conditional on the experimental design.  Although empirical evidence may be available on each of these three difficulties, the approach has no direct answer.  Instead, it allows the researcher to make formal guesses and evaluate the consequences of these guesses in light of other data.

\emph{Wason Task}.  To make the method clear, consider an example from the Wason task in Chapter Five.  In this task, I proposed a core hypothesis that there was some correlation between the receipt of disconfirming feedback and judgment that the feedback was error.  This correlation can be called phi ($\phi$), which is a correlation coefficient for a $2\times2$ binary contingency table.  Thus, my core hypothesis is that $\phi$ is positive, most likely above $0.3$. This can be modeled as a scaled beta distribution, with $2*\mathrm{Beta}(a, b)-1$ covering the interval $[-1, 1]$, the range of $\phi$.  The core hypothesis proposes a specific distribution, parameterized by \{a,b\}, that is consistent with $\phi$ being above $0.3$.  Suppose, based on intuition, we set the hypothesis with parameters $a=7$, $b=3$, giving a mean $\phi$ of $2*0.7-1=0.4$.  If we were to take this to be our naive model of the data (that is, assuming all auxiliary hypotheses are met), then we can take the observed sampling distribution of $\phi$ and compare it to our model to yield both a likelihood and posterior distribution of our hypothesis.

However, suppose we have some estimated risk of violating an auxiliary hypothesis about communication.  Suppose we think that if this auxiliary hypothesis is violated, then participants will not understand what they are expected to do, and will respond randomly.  Thus, if the auxiliary hypothesis is violated, we would expect $\phi$ to be distributed uniformly over the interval $[-1, 1]$.  Suppose that, based on our measurements in Stage Two, we found that of four participants interviewed, three participants understood the instructions, and one did not, so our posterior estimate of the probability of this auxiliary hypothesis being violated is distributed $\mathrm{Beta}(1, 3)$, assuming improper $\mathrm{Beta}(0, 0)$ priors (we could also use Jeffreys' invariance prior or an informative prior if we want).  Now, suppose our second experiment uses a different set of instructions which we estimated reduced this risk to $\mathrm{Beta}(1, 9)$; that is, out of 10 participants, only 1 failed to comprehend the instructions.   

To integrate the two experiments together with the core hypothesis, all we need to do is examine the following cases.  If our hypothesis is true and the auxiliary hypothesis is true in both experiments, then our hypothesis makes predictions $2*\mathrm{Beta}(7, 3)-1$ for $\phi$ for both experiments.  The expected value is $\phi=0.4$.  However, this is weighted by the probability that the auxiliary hypotheses are true for both experiments, which is the product of the two beta random variables (two coin flips) which is, on average, equal to $3/4*9/10=28/40$.  Next, it is possible that the communication auxiliary hypothesis was violated for the first experiment, but not the second.  This would happen, in expectation, with probability $1/4*9/10=9/40$.  Likewise, it is possible that the communication auxiliary hypothesis was violated for the second experiment and not the first, this would happen, in expectation, with probability $3/4*1/10=3/40$.  Finally, the auxiliary hypothesis could be violated in both experiments with probability $1/4*1/10=1/40$. 

As a result, if our core hypothesis is true, it makes the following predictions for the two experiments.  With probability $28/40$, it predicts $\phi \sim 2*\mathrm{Beta}(7, 3)-1$ for both experiments.  With probability $9/40$, it predicts $\phi \sim U[-1, 1]$ for the first experiment and $\phi \sim 2*\mathrm{Beta}(7, 3)-1$ for the second.  With probability $3/40$ it predicts $\phi \sim 2*\mathrm{Beta}(7, 3)-1$ for the first experiment and $\phi \sim U[-1, 1]$ for the second experiment.  Finally, with probability $1/40$ it will predict $\phi \sim U[-1, 1]$ for both experiments.  

What does this tell us?  In general, the predictions of our hypothesis should be more diffuse than if we were naive, as the predictions are mixed with the uniform distribution for $12/40$ cases, in expectation.

Now, what if we want to `debug' our experiment, and pinpoint which of two, non-equivalent auxiliary hypotheses were more likely to be violated?  Suppose that in both experiments we also had an alternative auxiliary hypothesis that participants inferred from the order of the error judgment and feedback that they should be highly related.  This would create a $\phi$ coefficient much higher than we would otherwise expect, say $\phi \sim 2*\mathrm{Beta}(19, 1)-1$.  Just as with the other auxiliary hypothesis, in Stage Two pretesting we found the violation of this hypothesis was $\mathrm{Beta}(2, 1)$ for the first experiment and $\mathrm{Beta}(1, 4)$ for the second.  That is, the first experiment used methods that were more likely to be susceptible to this order effect than the second. 

Using this, we can derive the following qualitative analysis.  If $\phi$ is very high in experiment one, this is likely due to the second auxiliary hypothesis (no order effects) being violated.  However, if it is very high in experiment two, we are more likely to expect that our theory was correct.  Negative $\phi$ coefficents for either experiment strongly suggest that participants didn't understand the instructions, although this is more true for the first rather than second experiment.   Importantly, the degree to which we update our core hypothesis will depend on the risk of these auxiliary hypotheses.

We can derive posterior probabilities for our core hypothesis if we have specific alternatives in mind by setting some values for \{a,b\} in the $\phi \sim 2*\mathrm{Beta}(a, b)-1$ distribution.  Alternatively, if we do not have any in mind, we can simulate them by drawing random \{a,b\} values of equal total to the one we consider (e.g., $a+b=10$).  That is, we first draw a value for $a \sim U[0, 10]$ and then calculate $b=10-a$.  By doing this repeatedly for $N$ models, we can create $N$ random alternative hypotheses to the one we've considered.  The prior probability of any of these hypotheses will be dirichlet distributed.  This procedure can be extended to arbitrary numbers of theories, auxiliary hypotheses and experiments.  However, computation slows down rapidly.

Each of the following examples will use a simple example.  The prior for auxiliary hypotheses are the same in experiment one, about 50\% as estimated from the Stage Two pretesting ($\mathrm{Beta}(1, 1)$).  After experiment one, we've refined the methods, and now our Stage Two pretesting of experiment two reveals an estimate of each auxiliary hypothesesis being true 75\% of the time ($\mathrm{Beta}(3, 1)$).

\subsection{Example 1: Two Disconfirmations}

Naively, we can consider two disconfirmations in both experiments being $\phi_{1}=0.1$ and $\phi_{2}=0.1$.  As seen in the table below, the generalized meta-analysis using the metropolis-hastings algorithm tells us that our core hypothesis is actually slightly more likely after receiving two `disconfirmations' ($\phi_{1}=\phi_{2}=0.1$). XX

\begin{table}[h]
  \begin{tabular}{c c c c}
    Hypothesis & Likelihood & Prior & Posterior \\ \hline
    Naive Core Hypothesis & $2*\mathrm{Beta}(14, 6)-1$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    GMA Core Hypothesis & $2*\mathrm{Beta}(14, 6)-1$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    Auxiliary 1 Exp 1 & $U[-1, 1]$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    Auxiliary 2 Exp 1 & $2*\mathrm{Beta}(19, 1)-1$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    Auxiliary 1 Exp 2 & $U[-1, 1]-1$ & $\mathrm{Beta}(3, 1)$ & ?? \\
    Auxiliary 2 Exp 2 & $2*\mathrm{Beta}(19, 1)-1$ & $\mathrm{Beta}(3, 1)$ & ??\\ \hline
\end{tabular}
\end{table}

\subsection{Example 2: Confirmation and Disconfirmation}

In example two, confirmation is observed in Experiment One ($\phi_{1}=0.4$) and disconfirmation is observed in Experiment Two ($\phi_{2}=0$). xx

\begin{table}[h]
  \begin{tabular}{c c c c}
    Hypothesis & Likelihood & Prior & Posterior \\ \hline
    Naive Core Hypothesis & $2*\mathrm{Beta}(14, 6)-1$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    GMA Core Hypothesis & $2*\mathrm{Beta}(14, 6)-1$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    Auxiliary 1 Exp 1 & $U[-1, 1]$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    Auxiliary 2 Exp 1 & $2*\mathrm{Beta}(19, 1)-1$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    Auxiliary 1 Exp 2 & $U[-1, 1]-1$ & $\mathrm{Beta}(3, 1)$ & ?? \\
    Auxiliary 2 Exp 2 & $2*\mathrm{Beta}(19, 1)-1$ & $\mathrm{Beta}(3, 1)$ & ??\\ \hline
\end{tabular}
\end{table}

\subsection{Example 3: Disconfirmation and Confirmation}
In example three, disconfirmation is observed in Experiment One ($\phi_{1}=0$) and confirmation is observed in Experiment Two ($\phi_{2}=0.4$). xx

The relationship is not symmetric.  A naive meta-analysis would give the same overall result with $\phi_{1}=0$ and $\phi_{2}=0.4$ as the reverse.  However, GMA allows us to weight the better second experiment more, thus yielding more confirmation of our hypothesis than the first.  This is because we believe that auxiliary hypothesis one was likely violated in Experiment One.

\begin{table}[h]
  \begin{tabular}{c c c c}
    Hypothesis & Likelihood & Prior & Posterior \\ \hline
    Naive Core Hypothesis & $2*\mathrm{Beta}(14, 6)-1$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    GMA Core Hypothesis & $2*\mathrm{Beta}(14, 6)-1$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    Auxiliary 1 Exp 1 & $U[-1, 1]$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    Auxiliary 2 Exp 1 & $2*\mathrm{Beta}(19, 1)-1$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    Auxiliary 1 Exp 2 & $U[-1, 1]-1$ & $\mathrm{Beta}(3, 1)$ & ?? \\
    Auxiliary 2 Exp 2 & $2*\mathrm{Beta}(19, 1)-1$ & $\mathrm{Beta}(3, 1)$ & ??\\ \hline
\end{tabular}
\end{table}

\subsection{Example 4: Two Confirmations}
In example four, confirmation is observed in Experiment One ($\phi_{1}=0.4$) and confirmation is observed in Experiment Two ($\phi_{2}=0.4$).

We can see that, although disconfirmation of experiment one doesn't harm our hypothesis, it also doesn't help much once we have confirmation from Experiment Two.  This is again in contrast to a naive approach which would have much stronger results with two confirmations than one.

\begin{table}[h]
  \begin{tabular}{c c c c}
    Hypothesis & Likelihood & Prior & Posterior \\ \hline
    Naive Core Hypothesis & $2*\mathrm{Beta}(14, 6)-1$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    GMA Core Hypothesis & $2*\mathrm{Beta}(14, 6)-1$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    Auxiliary 1 Exp 1 & $U[-1, 1]$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    Auxiliary 2 Exp 1 & $2*\mathrm{Beta}(19, 1)-1$ & $\mathrm{Beta}(1, 1)$ & ?? \\
    Auxiliary 1 Exp 2 & $U[-1, 1]-1$ & $\mathrm{Beta}(3, 1)$ & ?? \\
    Auxiliary 2 Exp 2 & $2*\mathrm{Beta}(19, 1)-1$ & $\mathrm{Beta}(3, 1)$ & ??\\ \hline
\end{tabular}
\end{table}

From this setup, we can condition on any observed $\phi$ values for Experiment One and Two and derive the posterior probability of the auxiliary hypotheses being violated in either or both experiment.  With this formalism, we can then determine what modifications to make to our experiment in Stage Three when our predictions are disconfirmed.  Interesting patterns can emerge, where negative results have no effect on the belief in our core hypothesis, or even increase our belief.

\section{Conclusion}

Using the Theory-Based Causal Induction approach of Griffiths and Tenenbaum \cite{griffiths2009theory} we can create carefully constructed theories that satisfy Popper's requirement of sufficient axiomatization for falsification.  We also defend against two conventionalist strategems of modifying our ostensive definitions or blaming the theoretician.  Additionally, by using a standardized set of both general and specific auxiliary hypotheses, along with additional ones generated for the specific topic, we satisfy Mayo's requirement for a precise ceteris paribus clause and error repertoires. This sets us up for the severe testing necessary in Stages Two, Three and Four.

Stage Two provides a novel approach to pre-testing.  By systematically imagining scenarios that would lead our predictions to be false (preposterior analysis) along with consideration of the general and specific auxiliary hypotheses from Stage One, we can design experiments to iteratively test, refine, and estimate the risk of these auxiliary hypotheses failing.  Using acceptance sampling and statistical quality control techniques, we and choose sample sizes for each iteration of the experimental design to estimate this risk without wasting resources.

Stage Three presents pilot-testing.  It assumes Lakatos' negative heuristic and directs empirical refutation at the auxiliary hypotheses.  Failed experiments are used to try to pinpoint the cause of the failure.  If the cause was one of the auxiliary hypotheses considered in Stage Two, we can use the Generalized Meta-Analysis to pinpoint the auxiliary hypotheses with the highest posterior probability of failing.  If the cause was not one of the auxiliary hypotheses in Stage Two, strange error patterns will emerge and this will encourage us to consider new auxiliary hypotheses and return to Stage Two to estimate their failure risk. 

Once we are quite sure that our auxiliary hypotheses are met, and our core hypothesis successfully predicts experimental results, we can compare it to an important alternative hypothesis in Stage Four.  At this point, the failure of the hypothesis indicates strong reason to reject it.  This is Mayo's severe test.

Finally, the evidence synthesis approach, Stage Five, allows the researcher to deal with the problem of `warm-up' experiments.  Those risky experiments will be weighted properly by this scheme.  Disconfirming evidence will do more to indicate that an auxiliary was violated than the core hypothesis was false.

\begin{epigraphs}
  \centering
  \qitem{``Although the number of works upon Methodeutic since Bacon's Novum Organum has been large, none has been greatly illuminative. Bacon's work was a total failure, eloquently pointing out some obvious sources of error, and to some minds stimulating, but affording no real help to an earnest inquirer. THE book on this subject remains to be written; and what I am chiefly concerned to do is to make the writing of it more possible.''}
{---\textsc{Charles Peirce, 1931, The Collected Works, Vol 2, 109 \cite{peirce1931collected}}}

\qitem{``When I was young, no remark was more frequent than that a given method, though excellent in one science, would be disastrous in another. If a mere aping of the externals of a method were meant, the remark might pass. But it was, on the contrary, applied to extensions of methods in their true souls. I early convinced myself that, on the contrary, that was the way in which methods must be improved; and great things have been accomplished during my life-time by such extensions. I mention my early foreseeing that it would be so, because it led me, in studying the methods which I saw pursued by scientific men, mathematicians, and other thinkers, always to seek to generalize my conception of their methods, as far as it could be done without destroying the forcefulness of those methods. This statement will serve to show about how much is to be expected from this part of my work.''}
{---\textsc{Charles Peirce, 1931, The Collected Works, Vol 2, 110 \cite{peirce1931collected}}}
\end{epigraphs}








